資料集

資料讀取

#####資料讀取#####
library(tidyverse)#裡面有管線
library(dplyr)
customer = read_csv("olist_data/olist_customers_dataset.csv")
geolocation =read_csv("olist_data/olist_geolocation_dataset.csv")
orders = read_csv("olist_data/olist_orders_dataset.csv")
order_items = read_csv("olist_data/olist_order_items_dataset.csv")
order_payments = read_csv("olist_data/olist_order_payments_dataset.csv")
order_reviews = read_csv("olist_data/olist_order_reviews_dataset.csv")
products = read_csv("olist_data/olist_products_dataset.csv")
sellers = read_csv("olist_data/olist_sellers_dataset.csv")
category = read_csv("olist_data/product_category_name_translation.csv")

資料集各欄變數介紹

1.customers

Each feature or columns of different csv files are described below:
The customers dataset contain following features:

Feature Description
customer_id key to the orders dataset. Each order has a unique customer_id.
customer_unique_id unique identifier of a customer.
customer_zip_code_prefix Zip Code of the location of the consumer.
customer_city customer city name.
customer_state customer state.

2.geolocation

The geolocation dataset contain following features:

Feature Description
geolocation_zip_code_prefix Zip Code of the location.
geolocation_lat latitude.
geolocation_lng longitude.
geolocation_city city name.

3.order_items

The order_items dataset contain following features:

Feature Description
order_id order unique identifier.
order_item_id sequential number identifying number of items included in the same order.
product_id product unique identifier.
seller_id seller unique identifier.
shipping_limit_date Shows the seller shipping limit date for handling the order over to the logistic partner.
price item price.
freight_value item freight value item (if an order has more than one item the freight value is splitted between items).

4.order_payments

The order_payments dataset contain following features:

Feature Description
order_id unique identifier of an order.
payment_sequential a customer may pay an order with more than one payment method. If he does so, a sequence will be created to accommodate all payments.
payment_type method of payment chosen by the customer.
payment_installments Nnumber of installments chosen by the customer.
payment_value transaction value.

5.order_reviews

The order_reviews dataset contain following features:

Feature Description
review_id unique review identifier.
order_id unique order identifier.
review_score Note ranging from 1 to 5 given by the customer on a satisfaction survey.
review_comment_title Comment title from the review left by the customer, in Portuguese.
review_comment_message Comment message from the review left by the customer, in Portuguese.
review_creation_date Shows the date in which the satisfaction survey was sent to the customer.
review_answer_timestamp Shows satisfaction survey answer timestamp.

6.orders

The orders dataset contain following features:

Feature Description
order_id unique identifier of the order.
customer_id key to the customer dataset. Each order has a unique customer_id.
order_status Reference to the order status (delivered, shipped, etc).
order_purchase_timestamp Shows the purchase timestamp.
order_approved_at Shows the payment approval timestamp.
order_delivered_carrier_date Shows the order posting timestamp. When it was handled to the logistic partner.
order_delivered_customer_date Shows the actual order delivery date to the customer.
order_estimated_delivery_date Shows the estimated delivery date that was informed to customer at the purchase moment.

7.products

The products dataset contain following features:

Feature Description
product_id unique product identifier.
product_category_name root category of product, in Portuguese.
product_name_lenght number of characters extracted from the product name.
product_description_lenght number of characters extracted from the product description.
product_photos_qty number of product published photos.
product_weight_g product weight measured in grams.
product_length_cm product length measured in centimeters.
product_height_cm product height measured in centimeters.
product_width_cm product width measured in centimeters.

8.sellers

The sellers dataset contain following features:

Feature Description
seller_id seller unique identifier.
seller_zip_code_prefix Zip Code of the location of the seller.
seller_city seller city name.
seller_state seller state

9.product_category_name_translation

The category dataset contain following features:

Feature Description
product_category_name category name in Portuguese.
product_category_name_english category name in English.

各資料集概況總覽

####summary####
options(width = 100)
library(funModeling)
df_status(customer)
##                   variable q_zeros p_zeros q_na p_na q_inf p_inf      type unique
## 1              customer_id       0       0    0    0     0     0 character  99441
## 2       customer_unique_id       0       0    0    0     0     0 character  96096
## 3 customer_zip_code_prefix       0       0    0    0     0     0 character  14994
## 4            customer_city       0       0    0    0     0     0 character   4119
## 5           customer_state       0       0    0    0     0     0 character     27
df_status(geolocation)
##                      variable q_zeros p_zeros q_na p_na q_inf p_inf      type unique
## 1 geolocation_zip_code_prefix       0       0    0    0     0     0 character  19015
## 2             geolocation_lat       0       0    0    0     0     0   numeric 717363
## 3             geolocation_lng       0       0    0    0     0     0   numeric 717615
## 4            geolocation_city       0       0    0    0     0     0 character   8010
## 5           geolocation_state       0       0    0    0     0     0 character     27
df_status(order_items)
##              variable q_zeros p_zeros q_na p_na q_inf p_inf           type unique
## 1            order_id       0    0.00    0    0     0     0      character  98666
## 2       order_item_id       0    0.00    0    0     0     0        numeric     21
## 3          product_id       0    0.00    0    0     0     0      character  32951
## 4           seller_id       0    0.00    0    0     0     0      character   3095
## 5 shipping_limit_date       0    0.00    0    0     0     0 POSIXct/POSIXt  93318
## 6               price       0    0.00    0    0     0     0        numeric   5968
## 7       freight_value     383    0.34    0    0     0     0        numeric   6999
df_status(order_payments)
##               variable q_zeros p_zeros q_na p_na q_inf p_inf      type unique
## 1             order_id       0    0.00    0    0     0     0 character  99440
## 2   payment_sequential       0    0.00    0    0     0     0   numeric     29
## 3         payment_type       0    0.00    0    0     0     0 character      5
## 4 payment_installments       2    0.00    0    0     0     0   numeric     24
## 5        payment_value       9    0.01    0    0     0     0   numeric  29077
df_status(sellers)
##                 variable q_zeros p_zeros q_na p_na q_inf p_inf      type unique
## 1              seller_id       0       0    0    0     0     0 character   3095
## 2 seller_zip_code_prefix       0       0    0    0     0     0 character   2246
## 3            seller_city       0       0    0    0     0     0 character    611
## 4           seller_state       0       0    0    0     0     0 character     23
df_status(category)
##                        variable q_zeros p_zeros q_na p_na q_inf p_inf      type unique
## 1         product_category_name       0       0    0    0     0     0 character     71
## 2 product_category_name_english       0       0    0    0     0     0 character     71
df_status(sellers)
##                 variable q_zeros p_zeros q_na p_na q_inf p_inf      type unique
## 1              seller_id       0       0    0    0     0     0 character   3095
## 2 seller_zip_code_prefix       0       0    0    0     0     0 character   2246
## 3            seller_city       0       0    0    0     0     0 character    611
## 4           seller_state       0       0    0    0     0     0 character     23
df_status(category)
##                        variable q_zeros p_zeros q_na p_na q_inf p_inf      type unique
## 1         product_category_name       0       0    0    0     0     0 character     71
## 2 product_category_name_english       0       0    0    0     0     0 character     71
df_status(orders)
##                        variable q_zeros p_zeros q_na p_na q_inf p_inf           type unique
## 1                      order_id       0       0    0 0.00     0     0      character  99441
## 2                   customer_id       0       0    0 0.00     0     0      character  99441
## 3                  order_status       0       0    0 0.00     0     0      character      8
## 4      order_purchase_timestamp       0       0    0 0.00     0     0 POSIXct/POSIXt  98875
## 5             order_approved_at       0       0  160 0.16     0     0 POSIXct/POSIXt  90733
## 6  order_delivered_carrier_date       0       0 1783 1.79     0     0 POSIXct/POSIXt  81018
## 7 order_delivered_customer_date       0       0 2965 2.98     0     0 POSIXct/POSIXt  95664
## 8 order_estimated_delivery_date       0       0    0 0.00     0     0 POSIXct/POSIXt    459
df_status(order_reviews)
##                  variable q_zeros p_zeros  q_na  p_na q_inf p_inf           type unique
## 1               review_id       0    0.00     0  0.00     0     0      character  98410
## 2                order_id       0    0.00     0  0.00     0     0      character  98673
## 3            review_score       0    0.00     0  0.00     0     0        numeric      5
## 4    review_comment_title      13    0.01 87658 88.34     0     0      character   4178
## 5  review_comment_message       1    0.00 58256 58.71     0     0      character  35743
## 6    review_creation_date       0    0.00     0  0.00     0     0 POSIXct/POSIXt    636
## 7 review_answer_timestamp       0    0.00     0  0.00     0     0 POSIXct/POSIXt  98248
df_status(products)
##                     variable q_zeros p_zeros q_na p_na q_inf p_inf      type unique
## 1                 product_id       0    0.00    0 0.00     0     0 character  32951
## 2      product_category_name       0    0.00  610 1.85     0     0 character     73
## 3        product_name_lenght       0    0.00  610 1.85     0     0   numeric     66
## 4 product_description_lenght       0    0.00  610 1.85     0     0   numeric   2960
## 5         product_photos_qty       0    0.00  610 1.85     0     0   numeric     19
## 6           product_weight_g       4    0.01    2 0.01     0     0   numeric   2204
## 7          product_length_cm       0    0.00    2 0.01     0     0   numeric     99
## 8          product_height_cm       0    0.00    2 0.01     0     0   numeric    102
## 9           product_width_cm       0    0.00    2 0.01     0     0   numeric     95

缺失值處理

缺失值主要來自:orders、order_reviews、products資料集

library(naniar)
orders %>% vis_miss()

order_reviews %>% vis_miss()

products %>% vis_miss()

variable_set=list(customer,geolocation,orders,order_items,order_payments,order_reviews,products,sellers,category)
n_colums <- map(variable_set,ncol) %>% unlist() 

#算出各變數的缺失值個數
variable_set=c(customer,geolocation,orders,order_items,order_payments,order_reviews,products,sellers,category)
missing_m <- variable_set%>% 
  map_int(~ sum(is.na(.))) %>% 
  as_tibble()

n <- which(missing_m>0)
c <- c()
n_rows <- variable_set %>% map(.,length)
n_rows <- n_rows [n] %>% unlist()
for (i in 1:length(n)) {
  if (n[i]<=n_colums[1]){
    print("customer")
    c <- c(c,"customer")
  }
  else if (n_colums[1]<n[i] && n[i]<=sum(n_colums[1:2])){
    print("geolocation")
    c <- c(c,"geolocation")
  }
  else if (sum(n_colums[1:2])<n[i] && n[i]<=sum(n_colums[1:3])){
    print("orders")
    c <- c(c,"orders")
  }
  else if (sum(n_colums[1:3])<n[i] && n[i]<=sum(n_colums[1:4])){
    print("order_items")
    c <- c(c,"order_items")
  }
  else if (sum(n_colums[1:4])<n[i] && n[i]<=sum(n_colums[1:5])){
    print("order_payments")
    c <- c(c,"order_payments")
  }
  else if (sum(n_colums[1:5])<n[i] && n[i]<=sum(n_colums[1:6])){
    print("order_reviews")
    c <- c(c,"order_reviews")
  }
  else if (sum(n_colums[1:6])<n[i] && n[i]<=sum(n_colums[1:7])){
    print("products")
    c <- c(c,"products")
  }
  else if (sum(n_colums[1:7])<n[i] && n[i]<=sum(n_colums[1:8])){
    print("sellers")
    c <- c(c,"sellers")
  }
  else if (sum(n_colums[1:8])<n[i] && n[i]<=sum(n_colums[1:9])){
    print("category")
    c <- c(c,"category")
  }
}
## [1] "orders"
## [1] "orders"
## [1] "orders"
## [1] "order_reviews"
## [1] "order_reviews"
## [1] "products"
## [1] "products"
## [1] "products"
## [1] "products"
## [1] "products"
## [1] "products"
## [1] "products"
## [1] "products"
#抓缺失資料及總筆數
c <- c %>%as_tibble() %>% count(value) %>% as.data.frame() 
missing_m <- variable_set%>% 
  map_int(~ sum(is.na(.))) %>% 
  .[which(.>0)]
n <- missing_m %>% 
  names() %>% as.data.frame()


missing_m <- missing_m %>% 
  as_tibble() %>% 
  cbind(n,.) %>% 
  as.data.frame() %>% 
  mutate(all_count=n_rows)
names(missing_m) <- c("variable","count","all_count")


missing_m <- missing_m %>% 
  mutate(label=c(rep("orders",3),rep("order_reviews",2),rep("products",8)),
         ratio=round(count/all_count, digits =3),
         percent=round((count/all_count)*100, digits =3))

缺失值可視化

##缺失值可視化
library(colorspace)
library(patchwork)#p1 + p2
p1 <- missing_m %>%
  ggplot(aes(x = reorder(variable,count),y = count,color=label,fill=label)) +
  geom_col()+ #顯示數值
  coord_flip()+#翻轉XY軸
  xlab("variable")+
  theme(legend.position = "none")+
  ggtitle("The number of missing value")+#顯示主題
  geom_text(aes(label = count, #顯示bar數值
                hjust =ifelse(count > 6000, 1.5, -0.3)),
            size =3.5)+
  scale_fill_discrete_sequential(palette = "Viridis", #調整bar的顏色
                                 nmax = 6, 
                                 rev = FALSE, 
                                 order = c(2,3,5),
                                 name="Data from:")+
  scale_color_discrete_sequential(palette = "Grays", #調整字的顏色
                                  nmax = 3, 
                                  rev = FALSE, 
                                  order = c(3,1,1),
                                  name="Data from:")
p2 <- missing_m %>%
  ggplot(aes(x = reorder(variable,ratio),
             y = ratio,
             color=label,
             fill=label)) +
  geom_col()+ 
  coord_flip()+
  xlab("variable")+
  theme(legend.position = c(0.75,0.15),
        axis.text.y = element_blank())+
  ggtitle("The ratio of missing value")+
  geom_text(aes(label = scales::percent(ratio), 
                hjust =ifelse(ratio > 0.5, 1.5, -0.3)),
            size =3.5)+
  scale_y_continuous(labels = scales::percent)+
  scale_fill_discrete_sequential(palette = "Viridis", 
                                 nmax = 6, 
                                 rev = FALSE, 
                                 order = c(2,3,5),
                                 name="Data from:")+
  scale_color_discrete_sequential(palette = "Grays",
                                  nmax = 3, 
                                  rev = FALSE, 
                                  order = c(3,1,1),
                                  name="Data from:")

p1+p2

缺失值刪除

1.order_reviews的變量裡:
   review_comment_title與review_comment_message 變量單獨遺失了超過50%的資訊,因此直接考慮刪除。
2.orders 與 products的觀測值缺失率少於5%,因此考慮直接刪除。

#####缺失值刪除#####
library(mice)
library(VIM)
# 1.刪變量 order_reviews
c1 <- names(which(missing_m$ratio>0.5))
location1 <- grep(c1[1],names(order_reviews))
location2 <- grep(c1[2],names(order_reviews))
order_reviews1 <- order_reviews[,c(-location1,-location2)]
# 2.刪觀測值 products、orders
# complete.cases():
# 當一筆資料(row)是完整的,回傳TRUE;當一筆資料有遺漏值,回傳FALSE
products1 <- products[complete.cases(products),]
orders1 <- orders[complete.cases(orders),]

# 3.確認沒有缺失值
orders1 %>% vis_miss()

order_reviews1 %>% vis_miss()

products1 %>% vis_miss()

orders1 %>% 
  filter(if_any(everything(), is.na))
## # A tibble: 0 × 8
## # ℹ 8 variables: order_id <chr>, customer_id <chr>, order_status <chr>,
## #   order_purchase_timestamp <dttm>, order_approved_at <dttm>, order_delivered_carrier_date <dttm>,
## #   order_delivered_customer_date <dttm>, order_estimated_delivery_date <dttm>
order_reviews1 %>% 
  filter(if_any(everything(), is.na))
## # A tibble: 0 × 5
## # ℹ 5 variables: review_id <chr>, order_id <chr>, review_score <dbl>, review_creation_date <dttm>,
## #   review_answer_timestamp <dttm>
products1 %>% 
  filter(if_any(everything(), is.na))
## # A tibble: 0 × 9
## # ℹ 9 variables: product_id <chr>, product_category_name <chr>, product_name_lenght <dbl>,
## #   product_description_lenght <dbl>, product_photos_qty <dbl>, product_weight_g <dbl>,
## #   product_length_cm <dbl>, product_height_cm <dbl>, product_width_cm <dbl>

新增資料資訊

#####新增資料資訊 for orders/order_items2 #####
# wday()取出日期在一個星期內的序號,預設星期天為1,星期一為2。 

#orders
orders2 <- orders1 %>% 
  mutate(., 
         purchase_timestamp_year = year(orders1$order_purchase_timestamp),
         purchase_timestamp_month = month(orders1$order_purchase_timestamp),
         purchase_timestamp_day = mday(orders1$order_purchase_timestamp),
         purchase_timestamp_week = wday(orders1$order_purchase_timestamp, week_start = 1),
         purchase_timestamp_hour = hour(orders1$order_purchase_timestamp),
         ) 
orders02 <- orders %>% 
  mutate(., 
         purchase_timestamp_year = year(orders$order_purchase_timestamp),
         purchase_timestamp_month = month(orders$order_purchase_timestamp),
         purchase_timestamp_day = mday(orders$order_purchase_timestamp),
         purchase_timestamp_week = wday(orders$order_purchase_timestamp, week_start = 1),
         purchase_timestamp_hour = hour(orders$order_purchase_timestamp),
  ) 
#order_items
order_items2 <- order_items %>% 
  mutate(., 
         shipping_limit_year = year(order_items$shipping_limit_date),
         shipping_limit_month = month(order_items$shipping_limit_date),
         shipping_limit_day = mday(order_items$shipping_limit_date),
         shipping_limit_week = wday(order_items$shipping_limit_date, week_start = 1),
         shipping_limit_hour = hour(order_items$shipping_limit_date),
  ) 

串接資料

1.串接orders和customer資料

串接方式:customer_id,新名稱:orders_customer2

#####串接orders和customer資料,串接方式:customer_id,新名稱:orders_customer2 #####
#因orders2有缺失值砍了2980的觀測值,所以串接的時候資料有少
#所以最後改成orders串接customer,再刪缺失的觀測值(因為missing=0.4%)
n_distinct(orders$customer_id) == nrow(orders) # TRUE
## [1] TRUE
n_distinct(customer$customer_id) == nrow(customer) # TRUE
## [1] TRUE
orders_customer <- 
  left_join(orders, customer, by = "customer_id")
sum(is.na(orders_customer)) #sum=0
## [1] 4908
ptp <- orders_customer %>% vis_miss(warn_large_data = FALSE)
ptp + ggtitle("Missing Values Visualization for orders_customer")

orders_customer2 <- orders_customer[complete.cases(orders_customer),]
sum(is.na(orders_customer2)) #sum=0
## [1] 0
nrow(orders_customer2)
## [1] 96461
orders_customer3 <- orders_customer2 %>% 
  mutate(., 
         purchase_timestamp_year = year(orders_customer2$order_purchase_timestamp),
         purchase_timestamp_month = month(orders_customer2$order_purchase_timestamp),
         purchase_timestamp_day = mday(orders_customer2$order_purchase_timestamp),
         purchase_timestamp_week = wday(orders_customer2$order_purchase_timestamp, week_start = 1),
         purchase_timestamp_hour = hour(orders_customer2$order_purchase_timestamp)
  ) 

2.串接orders_customer2和order_items2資料

串接方式:order_id,新名稱:orders_customer2_items2

#####串接orders_customer2和order_items2資料,串接方式:order_id,新名稱:orders_customer2_items2#####
n_distinct(order_items2$order_id) == nrow(order_items2) 
## [1] FALSE
n_distinct(orders_customer2$order_id) == nrow(orders_customer2) 
## [1] TRUE
orders_customer2_items2 <- 
  left_join(orders_customer2, order_items2 , by = "order_id")
sum(is.na(orders_customer2_items2)) #sum=0
## [1] 0
nrow(orders_customer2_items2) 
## [1] 110180

3.串接products和category

串接方式:product_category_name,新名稱:products_new1

#####串接products和category,串接方式:product_category_name,新名稱:products_new #####
n_distinct(category$product_category_name) == nrow(category) # TRUE
## [1] TRUE
n_distinct(products$product_category_name) == nrow(products) # FALSE
## [1] FALSE
products_new <- 
  left_join(products, category, by = "product_category_name")
sum(is.na(products_new)) #sum!=0
## [1] 3071
sum(is.na(products$product_category_name)) #sum!=0
## [1] 610
ptp <- products_new %>% vis_miss(warn_large_data = FALSE)
ptp + ggtitle("Missing Values Visualization for products_new")

products_new1 <- products_new[complete.cases(products_new),]
sum(is.na(products_new1)) #sum=0
## [1] 0
nrow(products_new1 )
## [1] 32327

4.串接order_items2和products_new

串接方式:product_id,新名稱:items2_products1

######串接order_items2和products_new,串接方式:product_id,新名稱:items2_products1######
n_distinct(order_items2$product_id) == nrow(order_items2) 
## [1] FALSE
n_distinct(products_new1$product_id) == nrow(products_new1) 
## [1] TRUE
items2_products <- 
  left_join(order_items2,products_new , by = "product_id")
sum(is.na(items2_products)) #sum!=0 ,products 本身有缺失值
## [1] 8111
ptp <- items2_products %>% vis_miss(warn_large_data = FALSE)
ptp + ggtitle("Missing Values Visualization for items2_products")

items2_products1 <- items2_products[complete.cases(items2_products),]
sum(is.na(items2_products1)) #sum=0
## [1] 0
nrow(items2_products1)
## [1] 111022

5.串接order_items2和orders

串接方式:order_id,新名稱:items2_orders2

#####串接order_items2和orders,串接方式:order_id,新名稱:items2_orders#####
items2_orders<- left_join( order_items2,orders, by = "order_id") 
sum(is.na(items2_orders)) #sum=0
## [1] 3663
library(naniar)#缺失值套件
ptp <-items2_orders %>% vis_miss(warn_large_data = FALSE)
ptp + ggtitle("Missing Values Visualization for order_items2_orders")

# complete.cases():
items2_orders<-items2_orders[complete.cases(items2_orders),]
sum(is.na(items2_orders)) #sum=0
## [1] 0
nrow(items2_orders)
## [1] 110180
items2_orders2 <- items2_orders %>% 
  mutate(., 
         purchase_timestamp_year = year(items2_orders$order_purchase_timestamp),
         purchase_timestamp_month = month(items2_orders$order_purchase_timestamp),
         purchase_timestamp_day = mday(items2_orders$order_purchase_timestamp),
         purchase_timestamp_week = wday(items2_orders$order_purchase_timestamp, week_start = 1),
         purchase_timestamp_hour = hour(items2_orders$order_purchase_timestamp),
         purchase_timestamp_month1 = format(items2_orders$order_purchase_timestamp, "%Y-%m")
  ) 

6. 串接geolocation和customer

串接方式:zip_code_prefix,新名稱:customer_zip1

#####geolocation 畫巴西地圖(已完成)#####
names(customer)[3] <- "zip_code_prefix"
names(sellers)[2] <- "zip_code_prefix"
names(geolocation)[1] <- "zip_code_prefix"
#geolocation_group 地理資訊太多 組起來
geolocation_group <- geolocation %>% group_by(zip_code_prefix) %>% 
  summarise(
    geolocation_lat = mean(geolocation_lat),
    geolocation_lng = mean(geolocation_lng),
    geolocation_city = names(table(geolocation_city))[1],
    geolocation_state = names(table(geolocation_state))[1]
  )
#customer_zip 
n_distinct(customer$zip_code_prefix) == nrow(customer) 
## [1] FALSE
n_distinct(geolocation_group$zip_code_prefix) == nrow(geolocation_group) 
## [1] TRUE
sum(is.na(customer))
## [1] 0
sum(is.na(geolocation_group))
## [1] 0
customer_zip <- left_join(customer, geolocation_group, by = "zip_code_prefix")
sum(is.na(customer_zip)) #sum!=0
## [1] 1112
ptp <- customer_zip %>% vis_miss(warn_large_data = FALSE)
ptp + ggtitle("Missing Values Visualization for customer_zip")

customer_zip1 <- customer_zip[complete.cases(customer_zip),]

sum(is.na(customer_zip1)) #sum=0
## [1] 0
nrow(customer_zip1)
## [1] 99163

7. 串接geolocation和sellers

串接方式:zip_code_prefix,新名稱:sellers_zip1

#sellers_zip
n_distinct(sellers$zip_code_prefix) == nrow(sellers) 
## [1] FALSE
n_distinct(geolocation_group$zip_code_prefix) == nrow(geolocation_group) 
## [1] TRUE
sum(is.na(sellers))
## [1] 0
sum(is.na(geolocation_group))
## [1] 0
sellers_zip <- left_join(sellers, geolocation_group, by = "zip_code_prefix")
sum(is.na(sellers_zip)) #sum!=0 
## [1] 28
ptp <- sellers_zip %>% vis_miss(warn_large_data = FALSE)
ptp + ggtitle("Missing Values Visualization for sellers_zip")

sellers_zip1 <- sellers_zip[complete.cases(sellers_zip),]

sum(is.na(sellers_zip1)) #sum=0
## [1] 0
nrow(sellers_zip1)
## [1] 3088

8.串接orders和customer_zip

串接方式:customer_id,新名稱:orders_customer_distance

#####串接orders和customer_zip,串接方式:customer_id,新名稱:orders_customer_distance#####
#orders_customer_distance1 可用data
n_distinct(customer_zip$customer_id) == nrow(customer_zip) 
## [1] TRUE
n_distinct(orders02$customer_id) == nrow(orders02) 
## [1] TRUE
sum(is.na(customer_zip))
## [1] 1112
sum(is.na(orders02))
## [1] 4908
orders_customer_distance <- left_join(orders02, customer_zip, by = "customer_id")
sum(is.na(orders_customer_distance)) #sum!=0
## [1] 6020
ptp <- orders_customer_distance %>% vis_miss(warn_large_data = FALSE)
ptp + ggtitle("Missing Values Visualization for orders_customer_distance")

orders_customer_distance1 <- orders_customer_distance[complete.cases(orders_customer_distance),]

sum(is.na(orders_customer_distance1)) #sum=0
## [1] 0
nrow(orders_customer_distance1)
## [1] 96197
names(orders_customer_distance)[15:21] <- c("customer_zip_code_prefix","customer_city","customer_state","customer_lat","customer_lng","customer_geolocation_city","customer_geolocation_state")

9.串接order_items2和sellers_zip

串接方式:seller_id,新名稱:items2_sellers_distance

#####串接order_items2和sellers_zip,串接方式:seller_id,新名稱:items2_sellers_distance#####
#sellers_zip
n_distinct(sellers_zip$seller_id) == nrow(sellers_zip) 
## [1] TRUE
n_distinct(order_items2$seller_id) == nrow(order_items2) 
## [1] FALSE
sum(is.na(sellers_zip))
## [1] 28
sum(is.na(order_items2))
## [1] 0
items2_sellers_distance <- left_join(order_items2, sellers_zip, by = "seller_id")
sum(is.na(items2_sellers_distance)) #sum!=0 
## [1] 1012
ptp <- items2_sellers_distance %>% vis_miss(warn_large_data = FALSE)
ptp + ggtitle("Missing Values Visualization for items2_sellers_distance")

items2_sellers_distance1 <- items2_sellers_distance[complete.cases(items2_sellers_distance),]


sum(is.na(items2_sellers_distance1)) #sum=0
## [1] 0
nrow(items2_sellers_distance1)
## [1] 112397
names(items2_sellers_distance)[13:19] <- c("seller_zip_code_prefix","seller_city","seller_state","seller_lat","seller_lng","seller_geolocation_city","seller_geolocation_state")

10.串接orders_customer_distance和items2_sellers_distance

串接方式:order_id,新名稱:customer_sellers_zip

#####串接orders_customer_distance和items2_sellers_distance,串接方式:order_id,新名稱:customer_sellers_zip #####
n_distinct(orders_customer_distance$order_id) == nrow(orders_customer_distance) 
## [1] TRUE
n_distinct(items2_sellers_distance$order_id) == nrow(items2_sellers_distance) 
## [1] FALSE
sum(is.na(orders_customer_distance))
## [1] 6020
sum(is.na(items2_sellers_distance))
## [1] 1012
customer_sellers_zip <- left_join(orders_customer_distance, items2_sellers_distance, by = "order_id")
sum(is.na(customer_sellers_zip)) #sum!=0 
## [1] 21544
nrow(customer_sellers_zip)
## [1] 113425
ptp <- customer_sellers_zip %>% vis_miss(warn_large_data = FALSE)
ptp + ggtitle("Missing Values Visualization for customer_sellers_zip")

customer_sellers_zip1 <- customer_sellers_zip[complete.cases(customer_sellers_zip),]

sum(is.na(customer_sellers_zip1)) #sum=0
## [1] 0
nrow(customer_sellers_zip1)
## [1] 109644

探索式資料分析

1.每個訂單單次購買量觀察

#####order_id_cost#####
order_id_cost <- order_items2 %>% 
  group_by(order_id) %>% 
  summarise(
    items_buy = n(),#總共買多少商品
    n_distinct_shipping_limit_day = n_distinct(shipping_limit_date),#賣家向物流合作夥伴處理訂單的出貨限制日期。
    n_distinct_product = n_distinct(product_id),#同一訂單一次買多少種不同產品
    product_total_cost = sum(price),#總商品花費
    Fright_total_cost = sum(freight_value),#總商品運費
    Total_cost = sum(price)+sum(freight_value),#總花費
    max_cost = max(price), #消費單價最高的商品價錢
    min_cost = min(price), #消費單價最低的商品價錢
    avg_Fright = round(mean(freight_value),digits = 2),#平均運費
    avg_cost_nproduct = round(Total_cost/items_buy,digits = 2)#平均消費總價格(含運費)
  )
#總覽
summary(order_id_cost)
##    order_id           items_buy      n_distinct_shipping_limit_day n_distinct_product
##  Length:98666       Min.   : 1.000   Min.   :1.000                 Min.   :1.000     
##  Class :character   1st Qu.: 1.000   1st Qu.:1.000                 1st Qu.:1.000     
##  Mode  :character   Median : 1.000   Median :1.000                 Median :1.000     
##                     Mean   : 1.142   Mean   :1.004                 Mean   :1.038     
##                     3rd Qu.: 1.000   3rd Qu.:1.000                 3rd Qu.:1.000     
##                     Max.   :21.000   Max.   :3.000                 Max.   :8.000     
##  product_total_cost Fright_total_cost   Total_cost          max_cost          min_cost      
##  Min.   :    0.85   Min.   :   0.00   Min.   :    9.59   Min.   :   0.85   Min.   :   0.85  
##  1st Qu.:   45.90   1st Qu.:  13.85   1st Qu.:   61.98   1st Qu.:  42.20   1st Qu.:  40.00  
##  Median :   86.90   Median :  17.17   Median :  105.29   Median :  79.90   Median :  79.00  
##  Mean   :  137.75   Mean   :  22.82   Mean   :  160.58   Mean   : 126.63   Mean   : 125.29  
##  3rd Qu.:  149.90   3rd Qu.:  24.04   3rd Qu.:  176.87   3rd Qu.: 139.90   3rd Qu.: 139.84  
##  Max.   :13440.00   Max.   :1794.96   Max.   :13664.08   Max.   :6735.00   Max.   :6735.00  
##    avg_Fright     avg_cost_nproduct
##  Min.   :  0.00   Min.   :   9.34  
##  1st Qu.: 13.37   1st Qu.:  57.51  
##  Median : 16.36   Median :  96.22  
##  Mean   : 20.19   Mean   : 146.11  
##  3rd Qu.: 21.18   3rd Qu.: 163.03  
##  Max.   :409.68   Max.   :6929.31

每個訂單單次下單量

#每個訂單單次購買多少商品
order_id_cost %>% ggplot(aes(x = items_buy))+
  geom_bar(fill="#006699",alpha = 0.7)+
  xlab("number of items purchased")+
  ggtitle("Total items purchased per transaction")+
  geom_text(stat = "count", 
            aes(label = after_stat(count)),
            size = 3,vjust = -0.2)+
  scale_x_continuous(breaks = seq(0, max(order_id_cost$items_buy), by = 1))

每個訂單的各消費數量的金額

#每次各消費數量的金額 
order_id_cost %>% ggplot(aes(x =as.factor(items_buy) ,y = product_total_cost))+
  geom_boxplot()+
  xlab("number of items purchased")+
  ggtitle("Total cost by number of items purchased")

一次買2件以上的訂單數有:9803人(約9.94%)

order_id_cost[(which(order_id_cost$items_buy>=2)),] %>% count()
## # A tibble: 1 × 1
##       n
##   <int>
## 1  9803
cat("ratio:",round(9803/nrow(order_id_cost)*100, digits = 4),"%")
## ratio: 9.9355 %

2.每位消費者購買量觀察

#####customer_id_cost#####
customer_id_cost <- orders_customer2_items2 %>% 
  group_by(customer_unique_id,order_id) %>% 
  summarise(
    items_buy = n(),#總共買多少商品
    n_distinct_shipping_limit_day = n_distinct(shipping_limit_date),#賣家向物流合作夥伴處理訂單的出貨限制日期。
    n_distinct_product = n_distinct(product_id),#同一訂單一次買多少種不同產品
    product_total_cost = sum(price),#總商品花費
    Fright_total_cost = sum(freight_value),#總商品運費
    Total_cost = sum(price)+sum(freight_value),#總花費
    max_cost = max(price), #消費單價最高的商品價錢
    min_cost = min(price), #消費單價最低的商品價錢
    avg_Fright = round(mean(freight_value),digits = 2),#平均運費
    avg_cost_nproduct = round(Total_cost/items_buy,digits = 2)#平均消費總價格(含運費)
  )

每位消費者再購頻率

87.6%的消費者屬於一次性消費,約有9.7%的消費者會再次回購

##單一customer一次的購買次數
customer_id_n <- orders_customer2_items2 %>% 
  count(customer_unique_id, sort = TRUE) %>% 
  count(n, sort = TRUE) %>% 
  mutate(ratio = round(nn/sum(nn), digits =3))
  

customer_id_n %>%
  mutate(lable1 = str_c(scales::percent(ratio, accuracy=0.1)," ", "(",nn,")")) %>% 
  ggplot(aes(x = reorder(n,nn),
             y = ratio))+
  geom_col(alpha = 0.7)+
  geom_col(data = customer_id_n %>%  filter(n=="1"),fill="#E69F00")+
  xlab("Repurchase frequency")+
  ylab("Ratio of people")+
  coord_flip()+
  ggtitle("The repurchase frequency per customer in the store")+
  geom_text(aes(label = lable1, 
                hjust =ifelse(nn > 9000, 1.2, -0.3)),
            size =3)+
  scale_y_continuous(labels = scales::percent)

二次回購時間

second_purchase <- orders_customer2_items2 %>% 
  group_by(customer_unique_id) %>% 
  summarise(
    purchase_time = n(),#總共買多少商品
    first_buy = min(floor_date(order_purchase_timestamp,"day")),# 最早購買日期
    recent =  sort(floor_date(order_purchase_timestamp,"day"), decreasing = F)[2],# 第二次購買日期
    since = recent-first_buy#二次回購時間
  )


second_purchase1 <- second_purchase %>% filter(since>0) %>% 
  mutate(
    days =round(as.double(since, units="days"),digits =1)
  )
cat("mean:",mean(second_purchase1$days),"天 \n")
## mean: 116.7948 天
cat("variance:",var(second_purchase1$days),"天")
## variance: 13321.11 天

平均回購時間為116天,高峰期為24天。

purchase_label <- second_purchase1 %>% count(days)
density_est <- density(second_purchase1$days)
peak_days <- density_est$x[which.max(density_est$y)]

second_purchase1 %>% ggplot(aes(x = days))+
  geom_density()+
  xlab("Days")+
  ggtitle("Time interval to the second purchase")+
  geom_vline(aes(xintercept = mean(days)), color = "red", linetype = "dashed", linewidth = 0.8)+
  geom_vline(aes(xintercept = peak_days), color = "blue", linetype = "dashed", linewidth = 0.8) +
  annotate("text", x = mean(second_purchase1$days), y = 0, label = paste("Mean:", round(mean(second_purchase1$days), 2)), vjust = -5,hjust = -0.05, color = "red") +
  annotate("text", x = peak_days, y = 0, label = paste("Peak:", round(peak_days, 2)), vjust = -10, hjust = -0.03,color = "blue")+
  theme(
    plot.title = element_text(size = 15),
    axis.text = element_text(size = 10),
    axis.title = element_text(size = 10) )

3.銷售狀況

每月銷售額

items2_orders2_yeear<- items2_orders2 %>% 
  group_by(purchase_timestamp_year,purchase_timestamp_month) %>%
  summarise(
    orders=n(),
    revenue=sum(price)
  )

p1 <- items2_orders2_yeear %>% 
  filter(purchase_timestamp_year==2016) %>% 
  ggplot(aes(x =as.factor(purchase_timestamp_month),y = revenue))+
  geom_bar(stat="identity", fill="#f68060", alpha=.6, width=.4)+
  xlab("month")+
  ylim(0,1000000)+
  ggtitle("2016 mounth revenue")

p2 <- items2_orders2_yeear %>% 
  filter(purchase_timestamp_year==2017) %>% 
  ggplot(aes(x =as.factor(purchase_timestamp_month),y = revenue))+
  geom_bar(stat="identity", fill="#f68060", alpha=.6, width=.4)+
  xlab("month")+
  ggtitle("2017 mounth revenue")


p3 <- items2_orders2_yeear %>% 
  filter(purchase_timestamp_year==2018) %>% 
  ggplot(aes(x =as.factor(purchase_timestamp_month),y = revenue))+
  geom_bar(stat="identity", fill="#f68060", alpha=.6, width=.4)+
  xlab("month")+
  ggtitle("2018 mounth revenue")

p1+p2+p3

每月銷售量

#銷售量
p1 <- items2_orders2_yeear %>% 
  filter(purchase_timestamp_year==2016) %>% 
  ggplot(aes(x =as.factor(purchase_timestamp_month),y = orders))+
  geom_bar(stat="identity", fill="#f68060", alpha=.6, width=.4)+
  xlab("month")+
  ylim(0,9000)+
  ggtitle("2016 mounth orders")

p2 <- items2_orders2_yeear %>% 
  filter(purchase_timestamp_year==2017) %>% 
  ggplot(aes(x =as.factor(purchase_timestamp_month),y = orders))+
  geom_bar(stat="identity", fill="#f68060", alpha=.6, width=.4)+
  xlab("month")+
  ylim(0,9000)+
  ggtitle("2017 mounth orders")

p3 <- items2_orders2_yeear %>% 
  filter(purchase_timestamp_year==2018) %>% 
  ggplot(aes(x =as.factor(purchase_timestamp_month),y = orders))+
  geom_bar(stat="identity", fill="#f68060", alpha=.6, width=.4)+
  xlab("month")+
  ylim(0,9000)+
  ggtitle("2018 mounth orders")

p1+p2+p3

銷售額及銷售量逐年成長,趨於穩定,2017.11月有高峰期。

11月 銷售額及銷售量

單獨拉出2017年11月的銷售額及銷售量來看,11/24 銷售額及銷售量特別驚人,是黑色星期五特殊促銷日,銷售量達到頂峰,與平日差距較大。

items2_orders2_11<- items2_orders2 %>% 
  filter(purchase_timestamp_year==2017,purchase_timestamp_month==11) %>%
  group_by(purchase_timestamp_day) %>%
  summarise(
    orders=n(),
    revenue=sum(price)
  )
p1 <- items2_orders2_11 %>% 
  ggplot(aes(x =as.factor(purchase_timestamp_day),y = revenue))+
  geom_bar(stat="identity", fill="#f68060", alpha=.6, width=.4)+
  xlab("day")+
  ggtitle("2017.11 revenue")

p2 <- items2_orders2_11 %>% 
  ggplot(aes(x =as.factor(purchase_timestamp_day),y = orders))+
  geom_bar(stat="identity", fill="#f68060", alpha=.6, width=.4)+
  xlab("day")+
  ggtitle("2017.11 orders")
p1

p2

週銷量

週銷量平均在157之間

daily_orders <- orders2 %>%
  group_by(purchase_timestamp_year,purchase_timestamp_month,
           purchase_timestamp_day,purchase_timestamp_week) %>%
  summarise(orders = n())

month_orders <- orders2 %>%
  group_by(purchase_timestamp_year,purchase_timestamp_month) %>%
  summarise(orders = n())

summary(daily_orders)
##  purchase_timestamp_year purchase_timestamp_month purchase_timestamp_day purchase_timestamp_week
##  Min.   :2016            Min.   : 1.000           Min.   : 1.00          Min.   :1.000          
##  1st Qu.:2017            1st Qu.: 3.000           1st Qu.: 8.00          1st Qu.:2.000          
##  Median :2017            Median : 6.000           Median :16.00          Median :4.000          
##  Mean   :2017            Mean   : 5.822           Mean   :15.64          Mean   :3.997          
##  3rd Qu.:2018            3rd Qu.: 8.000           3rd Qu.:23.00          3rd Qu.:6.000          
##  Max.   :2018            Max.   :12.000           Max.   :31.00          Max.   :7.000          
##      orders       
##  Min.   :   1.00  
##  1st Qu.:  97.75  
##  Median : 145.50  
##  Mean   : 157.62  
##  3rd Qu.: 213.00  
##  Max.   :1147.00

barplot:
1. 每週下單量多集中大部分集中0~300個之間。
2. 禮拜五、六、日有些零散的離群值,下單數量波動較大。

daily_orders %>% 
  ggplot(aes(x = as.factor(purchase_timestamp_week), 
             y = orders,group =purchase_timestamp_week)) +
  geom_boxplot() +
  stat_boxplot(geom = "errorbar", width = 0.3)+
  ggtitle("week sales volume")+
  xlab("week") +
  ylab("orders") +
  geom_point(position=position_jitter(width = .15, height = 0, seed = 320), size = 0.75) +
  theme_classic(base_size = 19)

violin 密度圖:
1. 禮拜一、二下單量很平均
2. 禮拜三、四、六、日有兩個趨勢
3. 禮拜五下單量偏集中

daily_orders %>% 
  ggplot(aes(x = as.factor(purchase_timestamp_week), 
             y = orders,group =purchase_timestamp_week)) +
  geom_violin(color = "transparent", fill = "gray90") +
  ggtitle("week sales volume")+
  xlab("week") +
  ylab("orders") +
  geom_point(position=position_jitter(width = .15, height = 0, seed = 320), size = 0.75) +
  theme_classic(base_size = 19)

4.消費種類觀察

#改良新增products英文名
#單次消費總金額
items2_products_cost <- items2_products1 %>%
  group_by(order_id) %>%
  summarise(
    items_buy = n(),#總共買多少商品
    n_distinct_shipping_limit_day = n_distinct(shipping_limit_date),#賣家向物流合作夥伴處理訂單的出貨限制日期。
    n_distinct_product = n_distinct(product_id),#同一消費者一次買多少種不同產品
    n_distinct_seller = n_distinct(seller_id),#同一訂單一次向多少買家下單
    product_total_cost = sum(price),#總商品花費
    product_category = ifelse(items_buy==1,product_category_name_english,"2+"),
    Fright_total_cost = sum(freight_value),#總商品運費
    Total_cost = sum(price)+sum(freight_value),#總花費
    max_cost = max(price), #消費單價最高的商品價錢
    min_cost = min(price), #消費單價最低的商品價錢
    avg_Fright = round(mean(freight_value),digits = 2),#平均運費
    avg_cost_nproduct = round(Total_cost/items_buy,digits = 2)#平均消費總價格(含運費)
  )
######消費種類######
# 按產品類別計算銷售額,並依銷售額進行排序
top_10_categories <- items2_products %>%
  group_by(product_category_name_english) %>%
  summarise(
    total_price = sum(price),
    total_freight = sum(freight_value)
    ) %>%
  top_n(10, total_price) %>%
  arrange(desc(total_price)) %>%
  pull(product_category_name_english)

# 篩選出排名前10的產品類別的數據
items2_products_top10 <- items2_products %>%
  filter(product_category_name_english %in% top_10_categories)
# 畫出排名前10的箱型圖
ggplot(items2_products_top10, aes(x = product_category_name_english, y = price)) +
  geom_violin() +
  xlab("Product Category") +
  ylab("Total Cost") +
  ggtitle("Total Cost Purchased per Transaction for Top 10 Categories") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

##product_category_name 的NA項填 "None"
items2_products$product_category_name_english <- 
  replace(items2_products$product_category_name_english, 
          is.na(items2_products$product_category_name_english), "None")
sum(is.na(items2_products$product_category_name_english))
## [1] 0

熱門商品種類前幾名

1.「床 / 桌子 / 浴室」(9.9%)、
2.「健康 / 美容」(8.6%)、
3.「運動 / 休閒 」(7.7%)、
4.「傢具裝飾」(7.4%)、
5.「電腦配件」(6.9%)、
6.「家居用品」 (6.2%)、
7.「手錶 / 禮品」 (5.3%)

#熱門商品(n>700)
items2_products %>% count(product_category_name_english,sort = T) %>% 
  mutate(ratio = round(n/sum(n), digits =3),
         lable1 = str_c(scales::percent(ratio, accuracy=0.1)," ", "(",n,")")
  ) %>%
  filter(n>700) %>%
  ggplot(aes(x = reorder(product_category_name_english,n),
             y = ratio,
             fill = product_category_name_english))+
  geom_col(alpha = 0.7,
           show.legend = FALSE)+
  xlab("product_category_name")+
  coord_flip()+#翻轉XY軸
  ggtitle("The ratio of Popular Products")+
  geom_text(aes(label = lable1,
                hjust =ifelse(n > 9000, 1.1, -0.1)),
             size =3)+
  scale_y_continuous(labels = scales::percent)

冷門商品種類前幾名

1.「安全與服務」(0%)、
2.「時尚兒童服飾」(0%)、
3.「 cds_dvds_音樂劇」(0%)、
4.「美食」(0%)、
5.「藝術與工藝」(0%)、
6.「時尚運動」 (0%)、
7.「家_舒適」 (0%)

#冷門商品(n<100)
items2_products %>% count(product_category_name_english,sort = T) %>% 
  mutate(ratio = round(n/sum(n), digits =3),
         lable1 = str_c(scales::percent(ratio, accuracy=0.1)," ", "(",n,")")
  ) %>% 
  filter(n<100) %>%
  ggplot(aes(x = reorder(product_category_name_english,n),
             y = ratio,
             fill = product_category_name_english))+
  geom_col(alpha = 0.7,
           show.legend = FALSE)+
  xlab("product_category_name")+
  coord_flip()+#翻轉XY軸
  ggtitle("The ratio of Niche Products")+
  geom_text(aes(label = lable1,
                hjust =ifelse(n > 50, 1.1, -0.1)), 
            size =3)+
  scale_y_continuous(labels = scales::percent)

商品總共72類

unique(items2_products$product_category_name_english)
##  [1] "cool_stuff"                              "pet_shop"                               
##  [3] "furniture_decor"                         "perfumery"                              
##  [5] "garden_tools"                            "housewares"                             
##  [7] "telephony"                               "health_beauty"                          
##  [9] "books_technical"                         "fashion_bags_accessories"               
## [11] "bed_bath_table"                          "sports_leisure"                         
## [13] "consoles_games"                          "office_furniture"                       
## [15] "luggage_accessories"                     "food"                                   
## [17] "agro_industry_and_commerce"              "electronics"                            
## [19] "computers_accessories"                   "construction_tools_construction"        
## [21] "audio"                                   "baby"                                   
## [23] "construction_tools_lights"               "toys"                                   
## [25] "stationery"                              "industry_commerce_and_business"         
## [27] "watches_gifts"                           "auto"                                   
## [29] "None"                                    "home_appliances"                        
## [31] "kitchen_dining_laundry_garden_furniture" "air_conditioning"                       
## [33] "home_confort"                            "fixed_telephony"                        
## [35] "small_appliances_home_oven_and_coffee"   "diapers_and_hygiene"                    
## [37] "signaling_and_security"                  "musical_instruments"                    
## [39] "small_appliances"                        "costruction_tools_garden"               
## [41] "art"                                     "home_construction"                      
## [43] "books_general_interest"                  "party_supplies"                         
## [45] "construction_tools_safety"               "cine_photo"                             
## [47] "fashion_underwear_beach"                 "fashion_male_clothing"                  
## [49] "food_drink"                              "drinks"                                 
## [51] "furniture_living_room"                   "market_place"                           
## [53] "music"                                   "fashion_shoes"                          
## [55] "flowers"                                 "home_appliances_2"                      
## [57] "fashio_female_clothing"                  "computers"                              
## [59] "books_imported"                          "christmas_supplies"                     
## [61] "furniture_bedroom"                       "home_comfort_2"                         
## [63] "dvds_blu_ray"                            "cds_dvds_musicals"                      
## [65] "arts_and_craftmanship"                   "furniture_mattress_and_upholstery"      
## [67] "tablets_printing_image"                  "costruction_tools_tools"                
## [69] "fashion_sport"                           "la_cuisine"                             
## [71] "security_and_services"                   "fashion_childrens_clothes"

5.訂單支付方式

大多數訂單都是使用信用卡支付的,第二常用的付款方式是 Boleto (銀行轉帳支付) 。

payment_type_n <- order_payments %>% count(payment_type) %>% 
  mutate(ratio = round(n/sum(n),digits = 4))

payment_type_n %>%
  mutate(lable1 = str_c(scales::percent(ratio)," ", "(",n,")")) %>% 
  ggplot(aes(x = reorder(payment_type,-ratio),
             y = ratio))+
  geom_col(fill="#006699",alpha = 0.7)+
  xlab("payment type")+
  ggtitle("The ratio of payment type")+
  geom_text(aes(label = lable1, 
                vjust = -0.3 ),
            size =3.5)+
  scale_y_continuous(labels = scales::percent)

  1. 約50%使用信用卡支付的消費者會分一期,平均分2.8期。
  2. 大多顧客選擇分期期數偏少,比較特別的是約有5%的顧客分期高達10期。
#number of installments chosen by the customer. 客戶選擇的分期付款次數。
library("RColorBrewer") #display.brewer.pal()
library(colorspace)    
payment_installments_n <- order_payments %>% count(payment_installments) %>% 
  mutate(ratio = round(n/sum(n),digits = 3))

payment_installments_n %>%
  mutate(lable1 = str_c(scales::percent(ratio, accuracy=0.1)," ", "(",n,")")) %>% 
  ggplot(aes(x = reorder(payment_installments,ratio),
             y = ratio,
             fill=as.factor(payment_installments)))+
  geom_col(alpha = 0.7)+
  xlab("payment installments")+
  coord_flip()+
  ggtitle("Percentage of customers choosing installment terms")+
  geom_text(aes(label = lable1, 
                hjust =ifelse(n > 50000, 1.3, -0.2)
                ),
            color = ifelse(payment_installments_n$n > 50000, "white","black"),
            size =3.5)+
  scale_y_continuous(labels = scales::percent)+
  scale_fill_discrete_sequential(palette = "Blues 3", 
                                 nmax = length(payment_installments_n$payment_installments),
                                 rev = FALSE,
                                 name = "payment installments")

##payment_value 
summary(order_payments)
##    order_id         payment_sequential payment_type       payment_installments payment_value     
##  Length:103886      Min.   : 1.000     Length:103886      Min.   : 0.000       Min.   :    0.00  
##  Class :character   1st Qu.: 1.000     Class :character   1st Qu.: 1.000       1st Qu.:   56.79  
##  Mode  :character   Median : 1.000     Mode  :character   Median : 1.000       Median :  100.00  
##                     Mean   : 1.093                        Mean   : 2.853       Mean   :  154.10  
##                     3rd Qu.: 1.000                        3rd Qu.: 4.000       3rd Qu.:  171.84  
##                     Max.   :29.000                        Max.   :24.000       Max.   :13664.08

6.星期和下單時間

星期和下單時間的熱力圖

透過星期與下單時間的熱力圖發現:下單高峰時間為星期一到五的10點~23點之間,且週日18點至23點之間也有一波下單趨勢。

#####熱力圖#####
library(pheatmap)
#星期和下單時間的熱力圖
heatmap_data <- table(orders2$purchase_timestamp_week, orders2$purchase_timestamp_hour)
pheatmap(heatmap_data,
         cluster_rows = F,  # 對行進行聚類
         cluster_cols = F,  # 對列進行聚類
         scale = "none",        # 按行縮放數據 'none', 'row' or 'column'
         color = rev(
           colorspace::sequential_hcl(
             n = n_distinct(orders2$purchase_timestamp_hour), 
             palette = "Purples 3") ),
         treeheight_col = 10, #刪除分支圖
         treeheight_row =10, #刪除分支圖
         legend =T,
         main = "2016.9 ~2018.8 Heatmap between weeks and hours",  # 標題
         display_numbers=T, #單位空格顯示數字
         number_color = "black",
         number_format = "%.0f",#單位數字顯示小數位
         fontsize   =10, #標題大小
         fontsize_number = 8, #單位空格大小
         angle_col = 0  # 逆時針旋轉90度
         )

拆分年度觀察每週的下單概況:

  1. 2016年資料較少,暫不做評論。2017及2018年的下單時間很相似。

  2. 推測:下單量可能同時受時間與星期影響。

heatmap_plot <- function(heatmap_data, main_title){
  pheatmap(heatmap_data,
           cluster_rows = F,  # 對行進行聚類
           cluster_cols = F,  # 對列進行聚類
           scale = "none",    # 按行縮放數據 'none', 'row' or 'column'
           color = rev(
             colorspace::sequential_hcl(
               n = n_distinct(orders2$purchase_timestamp_hour), 
               palette = "Purples 3") ),
           legend   =T,
           main = main_title,# 標題
           display_numbers=T,
           number_color ="black",
           number_format = "%.0f",
           fontsize =10, 
           fontsize_number = 8,
           angle_col = 0  # 逆时针旋转90度
  )
}
orders2_2016 <- orders2 %>% filter(purchase_timestamp_year=="2016")
heatmap_data <- table(orders2_2016$purchase_timestamp_week, orders2_2016$purchase_timestamp_hour)
heatmap_plot(heatmap_data,main = "2016 Heatmap between weeks and hours")

orders2_2017 <- orders2 %>% filter(purchase_timestamp_year=="2017")
heatmap_data <- table(orders2_2017$purchase_timestamp_week, orders2_2017$purchase_timestamp_hour)
heatmap_plot(heatmap_data,main = "2017 Heatmap between weeks and hours")

orders2_2018 <- orders2 %>% filter(purchase_timestamp_year=="2018")
heatmap_data <- table(orders2_2018$purchase_timestamp_week, orders2_2018$purchase_timestamp_hour)
heatmap_plot(heatmap_data,main = "2018 Heatmap between weeks and hours")

星期和下單時間的卡方檢定

獨立性檢定 ( 卡方檢驗 ): 星期、時間是否和銷售量存在顯著關聯

\(H_0\) : 星期、時間和銷售量不存在顯著相關\(\quad\) vs \(\quad\) \(H_1\) : 星期、時間和銷售量存在顯著相關

推論: p-value < 0.05,星期、時間和銷售量存在顯著相關。

不同的星期中,消費者的購買行為在一天內不同的時間存在顯著差異。

heatmap_data <- table(orders2$purchase_timestamp_week, orders2$purchase_timestamp_hour)
chisq.test(heatmap_data)
## 
##  Pearson's Chi-squared test
## 
## data:  heatmap_data
## X-squared = 1138.5, df = 138, p-value < 2.2e-16

7.地理資訊

消費者的城市分佈

  1. 消費者中 42% 來自SP(聖保羅)、12.9% 來自RJ(里約熱內盧)、11.7% 來自MG(米納斯吉拉斯州)

  2. 超過半數消費者來自這些州(66.6%)。

customer_state_n <- customer %>% count(customer_state, sort = TRUE) %>% 
  mutate(ratio = round(n/sum(n), digits =3))

customer_state_n %>%
  mutate(lable1 = str_c(scales::percent(ratio, accuracy=0.1)," ", "(",n,")")) %>% 
  ggplot(aes(x = reorder(customer_state,n),
             y = ratio))+
  geom_col(alpha = 0.7)+
  geom_col(data = customer_state_n %>%  filter( (customer_state=="SP") | (customer_state=="RJ") | (customer_state=="MG")),fill="#E69F00")+
  xlab("customer state")+
  coord_flip()+
  ggtitle("The ratio of customer state")+
  geom_text(aes(label = lable1,
                hjust =ifelse(n > 40000, 1.2, -0.3)),
            size =3)+
  scale_y_continuous(labels = scales::percent)

賣家的城市分佈

  1. 賣家中 59.7% 來自SP(聖保羅州)、11.3% 來自PR(巴拉那州)、7.9% 來自MG(米納斯吉拉斯州)。

  2. 超過半數賣家來自這些州(78.9%)。

#sellers_state  barplott 城市分布圖
sellers_state_n <- sellers %>% count(seller_state, sort = TRUE) %>% 
  mutate(ratio = round(n/sum(n), digits =3))

sellers_state_n %>%
  mutate(lable1 = str_c(scales::percent(ratio, accuracy=0.1)," ", "(",n,")")) %>% 
  ggplot(aes(x = reorder(seller_state,n),
             y = ratio))+
  geom_col(alpha = 0.7)+
  geom_col(data = sellers_state_n %>%  filter( (seller_state=="SP") | (seller_state=="RJ") | (seller_state=="MG")),fill="#E69F00")+
  xlab("sellers state")+
  coord_flip()+
  ggtitle("The ratio of sellers state")+
  geom_text(aes(label = lable1,
                hjust =ifelse(n > 500, 1.2, -0.3)),
            size =3)+
  scale_y_continuous(labels = scales::percent)

#畫巴西地圖(已完成)
library(maps) #世界各地地圖
library(rnaturalearth)
library(rnaturalearthhires) # ne_states
library(sf) #simple feature
brazil <- ne_countries(country = "Brazil", returnclass = "sf")
brazil_states <- ne_states(country = "Brazil", returnclass = "sf")

# 轉換為sf格式
customer_sf <- st_as_sf(customer_zip1, 
                        coords = c("geolocation_lng", "geolocation_lat"), 
                        crs = 4326)
sellers_sf <- st_as_sf(sellers_zip1 , 
                       coords = c("geolocation_lng", "geolocation_lat"), 
                       crs = 4326)

消費者的座標

##只有點 沒有地圖
ggplot(customer_zip1, 
       aes(geolocation_lng, geolocation_lat)) + 
  geom_point(size = .25, show.legend = FALSE) +
  coord_quickmap()

消費者和賣家的座標

#customer population
g1 <- ggplot() +
  geom_sf(data = brazil_states)+
  geom_sf(data = customer_sf,  aes(color = "red"), size = 1) + #在地圖上繪製點
  geom_sf_text(data = brazil_states, aes(label = name), size = 3) + # 標記縣市名
  labs(title = "customer population in Brazil") + 
  xlab("Longitude") + ylab("Latitude") +
  theme(legend.position = "none")
#sellers population
g2 <- ggplot() +
  geom_sf(data = brazil_states)+
  geom_sf(data = sellers_sf,  aes(color = "red"), size = 1) + 
  geom_sf_text(data = brazil_states, aes(label = name), size = 3) + 
  labs(title = "sellers population in Brazil") +  
  xlab("Longitude") + ylab("Latitude") +
  theme(legend.position = "none")
g1+g2

消費者和賣家的城市分佈地理圖

  1. 從地圖看出可以觀察到消費者與賣家分佈都是以SP(聖保羅)為中心,向鄰近的州擴散。

  2. 沒有賣家來自RR (羅賴馬州) 、AP (阿馬帕州) 、TO (托坎廷斯州) 、AL (阿拉戈斯州)這四州。

cn <- customer_zip1 %>% group_by(customer_state) %>% 
  summarise(
    count = n()
  )

bn <- sellers_zip1 %>% group_by(seller_state) %>% 
  summarise(
    count = n()
  )
#customer population
#way2
brazil_states$iso_3166_2 <- str_remove(brazil_states$iso_3166_2, "BR-")
names(cn)[1] <- "iso_3166_2"
names(bn)[1] <- "iso_3166_2"
brazil_states1 <- 
  left_join(brazil_states, cn, by = "iso_3166_2")

brazil_states2 <- 
  left_join(brazil_states, bn, by = "iso_3166_2")

ggplot(customer_sf) + 
  geom_sf(data = brazil_states1,aes(fill = count))+
  coord_sf()+
  geom_sf_text(data = brazil_states1, aes(label = iso_3166_2), size = 3)+
  scale_fill_continuous_sequential(palette = "Reds3", rev = T)+
  labs(title = "Customer population in Brazil") +
  xlab("Longitude") + ylab("Latitude")
## Warning in st_point_on_surface.sfc(sf::st_zm(x)): st_point_on_surface may not give correct results
## for longitude/latitude data

ggplot(sellers_sf) + 
  geom_sf(data = brazil_states2,aes(fill = count))+
  coord_sf()+
  geom_sf_text(data = brazil_states2, aes(label = iso_3166_2), size = 3)+
  scale_fill_continuous_sequential(palette = "Pink Yl", rev = T)+
                                   #limits = c(0, 40000))+
  labs(title = "Sellers population in Brazil") + 
  xlab("Longitude") + ylab("Latitude")
## Warning in st_point_on_surface.sfc(sf::st_zm(x)): st_point_on_surface may not give correct results
## for longitude/latitude data

建立迴歸模型

##每日訂單量density

##### dayly_orders #####
daily_orders <- orders2 %>%
  group_by(purchase_timestamp_year,purchase_timestamp_month,
           purchase_timestamp_day,purchase_timestamp_week) %>%
  summarise(orders = n())

month_orders <- orders2 %>%
  group_by(purchase_timestamp_year,purchase_timestamp_month) %>%
  summarise(orders = n())

hour_orders <- orders2 %>%
  group_by(purchase_timestamp_year,purchase_timestamp_month,
           purchase_timestamp_day,purchase_timestamp_week,
           purchase_timestamp_hour) %>%
  summarise(orders = n())

每天下單量似 Poisson 分布,唯獨平均值(157.616) ≠ 變異數 (8014.05)

#####daily_orders histogram#####
daily_orders %>% ggplot(aes(x = orders)) +
  geom_histogram(aes(y = after_stat(density)))+
  geom_density(color = "blue", linewidth = 1)+
  geom_vline(aes(xintercept = mean(orders)), color = "red", linetype = "dashed", linewidth = 1) + 
  ggtitle("daily_orders density")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

cat("mean:",mean(daily_orders$orders),"\n")
## mean: 157.616
cat("variance:",var(daily_orders$orders),"\n")
## variance: 8014.05
cat("max:",max(daily_orders$orders))
## max: 1147

扣除 11/24 那天的銷售量(離群值),依舊平均值155.9967 ≠ 變異數 6419.839
不適合 Poisson 分佈,考慮以 Negative Binomial Regression 建立預測模型。

daily_orders1 <- daily_orders[-which(daily_orders$orders==1147),]
daily_orders1 %>% ggplot(aes(x = orders)) +
  geom_histogram(aes(y = after_stat(density)))+
  geom_density(color = "blue", linewidth = 1)+
  geom_vline(aes(xintercept = mean(orders)), color = "red", linetype = "dashed", size = 1) + 
  ggtitle("daily_orders density")
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was generated.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

cat("mean:",mean(daily_orders1$orders),"\n")
## mean: 155.9967
cat("variance:",var(daily_orders1$orders),"\n")
## variance: 6419.839
cat("max:",max(daily_orders1$orders))
## max: 487

1.建立初始 negbin_model 模型

模型1

\(y\) ~ \(week\)

#####初始 negbin_model #####
library(MASS)
daily_orders$purchase_timestamp_week <- as.factor(daily_orders$purchase_timestamp_week)
negbin_model <- glm.nb(orders~ purchase_timestamp_week , data = daily_orders)
summary(negbin_model)
## 
## Call:
## glm.nb(formula = orders ~ purchase_timestamp_week, data = daily_orders, 
##     init.theta = 3.010848428, link = log)
## 
## Coefficients:
##                           Estimate Std. Error z value Pr(>|z|)    
## (Intercept)               5.184270   0.061951  83.684  < 2e-16 ***
## purchase_timestamp_week2 -0.001454   0.087864  -0.017  0.98680    
## purchase_timestamp_week3 -0.029452   0.087874  -0.335  0.73751    
## purchase_timestamp_week4 -0.092195   0.087647  -1.052  0.29285    
## purchase_timestamp_week5 -0.137844   0.087665  -1.572  0.11586    
## purchase_timestamp_week6 -0.386487   0.088036  -4.390 1.13e-05 ***
## purchase_timestamp_week7 -0.288577   0.087986  -3.280  0.00104 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for Negative Binomial(3.0108) family taken to be 1)
## 
##     Null deviance: 687.61  on 611  degrees of freedom
## Residual deviance: 653.91  on 605  degrees of freedom
## AIC: 7130.1
## 
## Number of Fisher Scoring iterations: 1
## 
## 
##               Theta:  3.011 
##           Std. Err.:  0.170 
## 
##  2 x log-likelihood:  -7114.126

基準值為禮拜一,禮拜一、六、日為影響銷售量的顯著性變數,表中顯示禮拜六、日的訂單數顯著減少。

par(mfrow = c(2, 2))
plot(negbin_model)

Residuals plot:

  1. Residuals vs Fitting value:Residuals 隨機分佈在0周圍,只有幾個高Residuals 值(334、335、336)。

  2. QQ plot:資料點大致上沿著45度對角線分佈,右尾有些偏離狀況,表示Residuals 非完全為常態分佈,右尾存在一些異常值。

  3. scale location plot:標準化 Residuals 平方根對於 Fitting value 平方根分佈情況,圖呈現隨機分佈,為同質變異數。

  4. residuals vs leverage plot:標準化 Residuals 對於槓桿點的分佈情況。此圖表示有零星高槓桿點與離群值(334、335、336)。

結論:

  1. 模型擬合情況:大致上良好,存在一些高 Residuals 值,需進一步檢查。

  2. Residuals分佈檢驗:右尾呈現偏離狀況,存在異常點或極端值。

  3. Scale-Location:表明Residuals 的變異數是同質的,不存在異質變異數問題。

  4. 異常值和高槓桿點:存在一些高槓桿點,需進一步檢查。

總體上是一個合理的模型,但存在一些高 Residuals 值和高槓桿點,需進一步檢查。

檢查觀測值:

第334、335、336觀測值分別是11/24、11/25、11/26,因活動造成訂單高峰的值,考慮刪除並重新建模。

daily_orders[c(334,335,336),]
## # A tibble: 3 × 5
## # Groups:   purchase_timestamp_year, purchase_timestamp_month, purchase_timestamp_day [3]
##   purchase_timestamp_y…¹ purchase_timestamp_m…² purchase_timestamp_day purchase_timestamp_w…³ orders
##                    <dbl>                  <dbl>                  <int> <fct>                   <int>
## 1                   2017                     11                     24 5                        1147
## 2                   2017                     11                     25 6                         487
## 3                   2017                     11                     26 7                         382
## # ℹ abbreviated names: ¹​purchase_timestamp_year, ²​purchase_timestamp_month,
## #   ³​purchase_timestamp_week
daily_orders1 <- daily_orders[-c(334,335,336),]
daily_orders1$purchase_timestamp_week <- as.factor(daily_orders1$purchase_timestamp_week)

negbin_model1 <- glm.nb(orders~ purchase_timestamp_week , data = daily_orders1)
summary(negbin_model1)
## 
## Call:
## glm.nb(formula = orders ~ purchase_timestamp_week, data = daily_orders1, 
##     init.theta = 3.211134206, link = log)
## 
## Coefficients:
##                           Estimate Std. Error z value Pr(>|z|)    
## (Intercept)               5.184270   0.060021  86.374  < 2e-16 ***
## purchase_timestamp_week2 -0.001454   0.085126  -0.017 0.986372    
## purchase_timestamp_week3 -0.029452   0.085137  -0.346 0.729394    
## purchase_timestamp_week4 -0.092195   0.084919  -1.086 0.277619    
## purchase_timestamp_week5 -0.213978   0.085216  -2.511 0.012039 *  
## purchase_timestamp_week6 -0.422196   0.085575  -4.934 8.07e-07 ***
## purchase_timestamp_week7 -0.310405   0.085513  -3.630 0.000284 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for Negative Binomial(3.2111) family taken to be 1)
## 
##     Null deviance: 694.10  on 608  degrees of freedom
## Residual deviance: 650.28  on 602  degrees of freedom
## AIC: 7045.8
## 
## Number of Fisher Scoring iterations: 1
## 
## 
##               Theta:  3.211 
##           Std. Err.:  0.183 
## 
##  2 x log-likelihood:  -7029.777
  1. 2 x log-likelihood上升:模型解釋力變好

  2. AIC下降:模型複雜度下降

par(mfrow = c(2, 2))
plot(negbin_model1)

結論:剔除異常點:334、335、336觀測值後

  1. Residuals 分析改進:剔除異常值,解決高 Residuals 值和高槓桿點問題,新模型表現更穩定。

  2. AIC 减少:新模型複雜度更低。

  3. 2 x log-likelihood 提升:新模型擬合度提高,解釋力更強。

檢驗

隨機分割資料集,0.7個資料集做訓練,0.3個資料集做測試。

######驗證#####
library(caret)
set.seed(123)
train_index <- createDataPartition(daily_orders1$orders, p = 0.7, list = FALSE)
train_data <- daily_orders1[train_index, ]
test_data <- daily_orders1[-train_index, ]
# train_data 
model <- glm.nb(formula = orders ~ purchase_timestamp_week, data = train_data, link = log)
summary(model)
## 
## Call:
## glm.nb(formula = orders ~ purchase_timestamp_week, data = train_data, 
##     link = log, init.theta = 3.216424384)
## 
## Coefficients:
##                          Estimate Std. Error z value Pr(>|z|)    
## (Intercept)               5.22455    0.07085  73.736  < 2e-16 ***
## purchase_timestamp_week2 -0.04940    0.09837  -0.502 0.615527    
## purchase_timestamp_week3 -0.12753    0.10151  -1.256 0.208977    
## purchase_timestamp_week4 -0.09532    0.10487  -0.909 0.363383    
## purchase_timestamp_week5 -0.28853    0.09957  -2.898 0.003758 ** 
## purchase_timestamp_week6 -0.46124    0.10171  -4.535 5.76e-06 ***
## purchase_timestamp_week7 -0.34961    0.10163  -3.440 0.000582 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for Negative Binomial(3.2164) family taken to be 1)
## 
##     Null deviance: 491.46  on 428  degrees of freedom
## Residual deviance: 457.96  on 422  degrees of freedom
## AIC: 4965.3
## 
## Number of Fisher Scoring iterations: 1
## 
## 
##               Theta:  3.216 
##           Std. Err.:  0.219 
## 
##  2 x log-likelihood:  -4949.344
predictions <- predict(model, test_data, type = "response")

# 評估模型準確度
mse <- mean((test_data$orders - predictions)^2)
rmse <- sqrt(mse)
aic <- AIC(model)
print(paste("MSE: ", mse))
## [1] "MSE:  5926.13496475934"
print(paste("RMSE: ", rmse))
## [1] "RMSE:  76.9813936270274"
print(paste("AIC: ", aic))
## [1] "AIC:  4965.34443471231"
##
train_data %>% ggplot(aes(x = orders)) +
  geom_histogram(aes(y = ..density..))+
  geom_density(color = "blue", size = 1)+
  geom_vline(aes(xintercept = mean(orders)), color = "red", linetype = "dashed", size = 1) + 
  ggtitle("daily_orders density")

print(paste("mean of train_data:",mean(train_data$orders)))
## [1] "mean of train_data: 154.839160839161"
##"Actual vs. Predicted Orders
library(ggplot2)
test_data$predicted_orders <- predict(model, newdata = test_data, type = "response")
ggplot(test_data, aes(x = orders, y = predicted_orders)) +
  geom_point() +
  geom_abline(slope = 1, intercept = 0, color = "red", linetype = "dashed") +
  labs(title = "Actual vs. Predicted Orders",
       x = "Actual Orders",
       y = "Predicted Orders") +
  theme_minimal()

交叉驗證:k = 10

# 訓練模型+交叉驗證
k <- 10
set.seed(123)
train_control <- trainControl(method = "cv", number = k)
model <- train(orders ~ purchase_timestamp_week,
               data = daily_orders1,
               method = "glm.nb",
               trControl = train_control)

print(model)
## Negative Binomial Generalized Linear Model 
## 
## 609 samples
##   1 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 547, 548, 548, 549, 548, 548, ... 
## Resampling results across tuning parameters:
## 
##   link      RMSE      Rsquared    MAE     
##   identity  75.66016  0.08889727  63.28178
##   log       75.66016  0.08889727  63.28178
##   sqrt      75.66016  0.08889727  63.28178
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was link = log.

總結

模擬效果不太好,實際值與預測值相差很大。

  1. 訂單數範圍落在:0~487之間,平均絕對誤差卻有63,均方根誤差也有75,表示預測值與實際值平均差距很大。

  2. 模型只能解釋預測值大约8.88%的變異,意味著模型沒有很好的捕捉到數據中的模式。

各指標判斷:

Mean Absolute Error (MAE)

\[ \text{MAE} = \frac{1}{n} \sum_{i=1}^{n} |y_i - \hat{y}_i| = 63.28178 \]

Root Mean Squared Error (RMSE)

\[ \text{RMSE} = \sqrt{\frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2} = 75.66016 \] Coefficient of Determination (R²)

\[ R^2 = 1 - \frac{\sum_{i=1}^{n} (y_i - \hat{y}_i)^2}{\sum_{i=1}^{n} (y_i - \bar{y})^2} = 0.08889727 = 8.88\% \]

2.建立模型2

加入昨日訂單量的差異

######創建lag1變數和每日差異變數#####
daily_orders1$lag1_orders <- dplyr::lag(daily_orders1$orders, 1)
daily_orders1$diff_orders <- daily_orders1$orders - dplyr::lag(daily_orders1$orders, 1)

# 移除NA值
daily_orders2 <- na.omit(daily_orders1)
##畫每日變動量
daily_orders2$N <- c(1:nrow(daily_orders2))
daily_orders2 %>%
  ggplot(aes(x=N,diff_orders)) +
  geom_line()+
  ggtitle("Differential variable for daily orders")

模型2

\(y\) ~ \(week + diff\_orders\)

##negbin_model2
negbin_model2 <- glm.nb(orders~ purchase_timestamp_week+ diff_orders, data = daily_orders2)
summary(negbin_model2)
## 
## Call:
## glm.nb(formula = orders ~ purchase_timestamp_week + diff_orders, 
##     data = daily_orders2, init.theta = 3.355518393, link = log)
## 
## Coefficients:
##                            Estimate Std. Error z value Pr(>|z|)    
## (Intercept)               5.0355398  0.0701814  71.750  < 2e-16 ***
## purchase_timestamp_week2  0.1569526  0.0923791   1.699  0.08932 .  
## purchase_timestamp_week3  0.1274052  0.0934646   1.363  0.17284    
## purchase_timestamp_week4  0.0943661  0.0948430   0.995  0.31975    
## purchase_timestamp_week5 -0.0155281  0.0986774  -0.157  0.87496    
## purchase_timestamp_week6 -0.1906683  0.1039308  -1.835  0.06657 .  
## purchase_timestamp_week7 -0.2051582  0.0880204  -2.331  0.01976 *  
## diff_orders               0.0024336  0.0008064   3.018  0.00254 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for Negative Binomial(3.3555) family taken to be 1)
## 
##     Null deviance: 702.59  on 607  degrees of freedom
## Residual deviance: 646.60  on 600  degrees of freedom
## AIC: 7014.6
## 
## Number of Fisher Scoring iterations: 1
## 
## 
##               Theta:  3.356 
##           Std. Err.:  0.192 
## 
##  2 x log-likelihood:  -6996.640

表中顯示每日變動量為影響銷售量的顯著性變數。

par(mfrow = c(2, 2))
plot(negbin_model2)

Residuals plot:

  1. Residuals vs Fitting value:Residuals 分佈在0周圍,分成兩群,似乎有某種趨勢。

  2. QQ plot:右尾有些偏離狀況,表示Residuals 非完全為常態分佈,右尾存在一些異常值。

  3. scale location plot:圖呈現分成兩群,似乎有某種趨勢。

  4. residuals vs leverage plot:此圖表示沒有零星高槓桿點與離群值。

結論:

  1. 模型擬合情況:似乎有一些趨勢,需進一步檢查。

  2. Residuals分佈檢驗:右尾呈現偏離狀況,存在異常點或極端值。

  3. Scale-Location:表明Residuals 的變異數不同,存在異質變異數問題。

  4. 異常值和高槓桿點:不存在異常值與高槓桿點。

總體上增加了模型解釋力,但存在某種關係,需進一步調整。

檢驗

###驗證
set.seed(123)
train_index <- createDataPartition(daily_orders2$orders, p = 0.7, list = FALSE)
train_data <- daily_orders2[train_index, ]
test_data <- daily_orders2[-train_index, ]
# train_data 
model <- glm.nb(formula = orders ~ purchase_timestamp_week + diff_orders, data = train_data, link = log)
summary(model)
## 
## Call:
## glm.nb(formula = orders ~ purchase_timestamp_week + diff_orders, 
##     data = train_data, link = log, init.theta = 3.357073728)
## 
## Coefficients:
##                            Estimate Std. Error z value Pr(>|z|)    
## (Intercept)               5.0389982  0.0810701  62.156  < 2e-16 ***
## purchase_timestamp_week2  0.1837154  0.1069400   1.718  0.08581 .  
## purchase_timestamp_week3  0.0695819  0.1104930   0.630  0.52886    
## purchase_timestamp_week4  0.0596925  0.1116288   0.535  0.59283    
## purchase_timestamp_week5 -0.0125628  0.1148012  -0.109  0.91286    
## purchase_timestamp_week6 -0.1698997  0.1210926  -1.403  0.16060    
## purchase_timestamp_week7 -0.2364969  0.1030128  -2.296  0.02169 *  
## diff_orders               0.0028126  0.0009253   3.040  0.00237 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for Negative Binomial(3.3571) family taken to be 1)
## 
##     Null deviance: 499.43  on 427  degrees of freedom
## Residual deviance: 455.99  on 420  degrees of freedom
## AIC: 4941.2
## 
## Number of Fisher Scoring iterations: 1
## 
## 
##               Theta:  3.357 
##           Std. Err.:  0.229 
## 
##  2 x log-likelihood:  -4923.218
predictions <- predict(model, test_data, type = "response")

mse <- mean((test_data$orders - predictions)^2)
rmse <- sqrt(mse)
aic <- AIC(model)
print(paste("MSE: ", mse))
## [1] "MSE:  5748.59644159849"
print(paste("RMSE: ", rmse))
## [1] "RMSE:  75.8194990856474"
print(paste("AIC: ", aic))
## [1] "AIC:  4941.21753255967"
test_data$predicted_orders <- predict(model, newdata = test_data, type = "response")
ggplot(test_data, aes(x = orders, y = predicted_orders)) +
  geom_point() +
  geom_abline(slope = 1, intercept = 0, color = "red", linetype = "dashed") +
  labs(title = "Actual vs. Predicted Orders",
       x = "Actual Orders",
       y = "Predicted Orders") +
  theme_minimal()

交叉驗證:k = 10

# 訓練模型+交叉驗證
k <- 10
set.seed(123)
train_control <- trainControl(method = "cv", number = k)
model <- train(orders ~ purchase_timestamp_week + diff_orders,
               data = daily_orders2,
               method = "glm.nb",
               trControl = train_control)

print(model)
## Negative Binomial Generalized Linear Model 
## 
## 608 samples
##   2 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 546, 547, 548, 548, 547, 547, ... 
## Resampling results across tuning parameters:
## 
##   link      RMSE      Rsquared   MAE     
##   identity  75.00710  0.1033302  62.86533
##   log       74.50003  0.1206841  62.89483
##   sqrt      74.70994  0.1144106  63.05278
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was link = log.

總結

  1. 模擬效果依舊有些差距,不過實際值與預測值有改善,添加方向看似是正確的。

  2. 交叉驗證比對後,平均誤差沒什麼變化,且解釋力沒有顯著提升。

各指標判斷:

Mean Absolute Error (MAE)

\[ \text{MAE} = \frac{1}{n} \sum_{i=1}^{n} |y_i - \hat{y}_i| = 62.86533 \]

Root Mean Squared Error (RMSE)

\[ \text{RMSE} = \sqrt{\frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2} = 75.00710 \] Coefficient of Determination (R²)

\[ R^2 = 1 - \frac{\sum_{i=1}^{n} (y_i - \hat{y}_i)^2}{\sum_{i=1}^{n} (y_i - \bar{y})^2} = 0.1033302 = 10.33\% \]

3.建立模型3

增加早上下午晚上凌晨時段資料

##增加時段資料
hour_orders <- hour_orders %>%
  mutate(period = case_when(
    purchase_timestamp_hour >= 0 & purchase_timestamp_hour <= 6 ~ "early_morning",
    purchase_timestamp_hour > 6 & purchase_timestamp_hour <= 12 ~ "morning",
    purchase_timestamp_hour > 12 & purchase_timestamp_hour <= 18 ~ "afternoon",
    purchase_timestamp_hour > 18 & purchase_timestamp_hour <=24 ~ "night"
  ))

period_orders <- hour_orders %>%
  group_by(purchase_timestamp_year,purchase_timestamp_month,
           purchase_timestamp_day,purchase_timestamp_week,period) %>%
  summarise(period_orders = sum(orders))

#values_fill參數將缺失值填補為0。
period_orders <- period_orders %>%
  pivot_wider(names_from = period, 
              values_from = period_orders, 
              values_fill = list(period_orders = 0))%>%
  mutate(orders = sum(morning+afternoon+night+early_morning))
#####增加早上下午晚上凌晨時段資料#####
#取每日差異變數
period_orders$lag1_orders <- dplyr::lag(period_orders$orders, 1)
period_orders$diff_orders <- abs(period_orders$orders - dplyr::lag(period_orders$orders, 1))

period_orders$lag1_morning <- dplyr::lag(period_orders$morning, 1)
period_orders$diff_morning <- abs(period_orders$morning - dplyr::lag(period_orders$morning, 1))

period_orders$lag1_afternoon <- dplyr::lag(period_orders$afternoon, 1)
period_orders$diff_afternoon <- abs(period_orders$afternoon - dplyr::lag(period_orders$afternoon, 1))

period_orders$lag1_night <- dplyr::lag(period_orders$night, 1)
period_orders$diff_night <- abs(period_orders$night - dplyr::lag(period_orders$night, 1))

period_orders$lag1_early_morning <- dplyr::lag(period_orders$early_morning, 1)
period_orders$diff_early_morning <- abs(period_orders$early_morning - dplyr::lag(period_orders$early_morning, 1))

# 移除NA值
period_orders1 <- na.omit(period_orders)

模型3

\(y\) ~ \(week + diff\_orders + diff\_morning + diff\_afternoon + diff\_night + diff\_early\_morning\)

period_orders1 <- period_orders1[-c(333,334,335),]
period_orders1$purchase_timestamp_week <- as.factor(period_orders1$purchase_timestamp_week)
period_orders1$purchase_timestamp_month <- as.factor(period_orders1$purchase_timestamp_month)

negbin_model3 <- glm.nb(orders~ purchase_timestamp_week+diff_orders+
                          diff_morning+diff_afternoon+
                          diff_night+diff_early_morning, 
                        data = period_orders1)

summary(negbin_model3)
## 
## Call:
## glm.nb(formula = orders ~ purchase_timestamp_week + diff_orders + 
##     diff_morning + diff_afternoon + diff_night + diff_early_morning, 
##     data = period_orders1, init.theta = 4.447943597, link = log)
## 
## Coefficients:
##                           Estimate Std. Error z value Pr(>|z|)    
## (Intercept)               4.360563   0.071757  60.768  < 2e-16 ***
## purchase_timestamp_week2  0.305134   0.077890   3.918 8.95e-05 ***
## purchase_timestamp_week3  0.289908   0.077665   3.733 0.000189 ***
## purchase_timestamp_week4  0.225370   0.077785   2.897 0.003764 ** 
## purchase_timestamp_week5  0.079473   0.077542   1.025 0.305413    
## purchase_timestamp_week6 -0.207187   0.075136  -2.757 0.005825 ** 
## purchase_timestamp_week7  0.004698   0.080386   0.058 0.953398    
## diff_orders              -0.002219   0.001484  -1.496 0.134771    
## diff_morning              0.014275   0.002457   5.809 6.30e-09 ***
## diff_afternoon            0.013420   0.002428   5.527 3.26e-08 ***
## diff_night                0.015152   0.002220   6.825 8.79e-12 ***
## diff_early_morning        0.026243   0.006287   4.174 2.99e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for Negative Binomial(4.4479) family taken to be 1)
## 
##     Null deviance: 918.40  on 607  degrees of freedom
## Residual deviance: 645.85  on 596  degrees of freedom
## AIC: 6849.9
## 
## Number of Fisher Scoring iterations: 1
## 
## 
##               Theta:  4.448 
##           Std. Err.:  0.262 
## 
##  2 x log-likelihood:  -6823.931

表中顯示每日各時段的變動量為影響銷售量的顯著性變數,反而有了時段的變動量後,每日變動量變不重要了。

par(mfrow = c(2, 2))
plot(negbin_model3)

Residuals plot:

  1. Residuals vs Fitting value:Residuals 分佈在0周圍,集中在一起,沒有明顯趨勢。

  2. QQ plot:右尾有些偏離狀況,表示Residuals 非完全為常態分佈,右尾存在一些異常值。

  3. scale location plot:分佈在0周圍,集中在一起,沒有明顯趨勢。

  4. residuals vs leverage plot:此圖表示沒有零星高槓桿點與離群值。

結論:

  1. 模型擬合情況:看似良好。

  2. Residuals分佈檢驗:右尾呈現偏離狀況,存在異常點或極端值。

  3. Scale-Location:表明Residuals 的變異數大致相同,不存在異質變異數問題。

  4. 異常值和高槓桿點:不存在異常值與高槓桿點。

總體上增加了模型解釋力。

檢驗

set.seed(123)
train_index <- createDataPartition(period_orders1$orders, p = 0.7, list = FALSE)
train_data <- period_orders1[train_index, ]
test_data <- period_orders1[-train_index, ]
# train_data 
model <- glm.nb(formula = orders ~ purchase_timestamp_week+diff_orders+
                  diff_morning+diff_afternoon+
                  diff_night+diff_early_morning,
                data = train_data, link = log)
summary(model)
## 
## Call:
## glm.nb(formula = orders ~ purchase_timestamp_week + diff_orders + 
##     diff_morning + diff_afternoon + diff_night + diff_early_morning, 
##     data = train_data, link = log, init.theta = 4.504909298)
## 
## Coefficients:
##                           Estimate Std. Error z value Pr(>|z|)    
## (Intercept)               4.400898   0.082438  53.385  < 2e-16 ***
## purchase_timestamp_week2  0.307413   0.090264   3.406  0.00066 ***
## purchase_timestamp_week3  0.190859   0.090466   2.110  0.03488 *  
## purchase_timestamp_week4  0.161698   0.091853   1.760  0.07834 .  
## purchase_timestamp_week5  0.064458   0.090547   0.712  0.47655    
## purchase_timestamp_week6 -0.222187   0.089679  -2.478  0.01323 *  
## purchase_timestamp_week7 -0.037956   0.094175  -0.403  0.68692    
## diff_orders              -0.002486   0.001689  -1.472  0.14106    
## diff_morning              0.014189   0.002856   4.968 6.75e-07 ***
## diff_afternoon            0.014314   0.002943   4.864 1.15e-06 ***
## diff_night                0.015563   0.002477   6.282 3.34e-10 ***
## diff_early_morning        0.018249   0.007718   2.364  0.01807 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for Negative Binomial(4.5049) family taken to be 1)
## 
##     Null deviance: 660.07  on 427  degrees of freedom
## Residual deviance: 456.00  on 416  degrees of freedom
## AIC: 4823
## 
## Number of Fisher Scoring iterations: 1
## 
## 
##               Theta:  4.505 
##           Std. Err.:  0.317 
## 
##  2 x log-likelihood:  -4797.041
predictions <- predict(model, test_data, type = "response")

mse <- mean((test_data$orders - predictions)^2)
rmse <- sqrt(mse)
aic <- AIC(model)

print(paste("MSE: ", mse))
## [1] "MSE:  4611.49800856018"
print(paste("RMSE: ", rmse))
## [1] "RMSE:  67.9080113724455"
print(paste("AIC: ", aic))
## [1] "AIC:  4823.04080163789"
test_data$predicted_orders <- predict(model, newdata = test_data, type = "response")
ggplot(test_data, aes(x = orders, y = predicted_orders)) +
  geom_point() +
  geom_abline(slope = 1, intercept = 0, color = "red", linetype = "dashed") +
  labs(title = "Actual vs. Predicted Orders",
       x = "Actual Orders",
       y = "Predicted Orders") +
  theme_minimal()

交叉驗證:k = 10

# 訓練模型+交叉驗證
k <- 10
set.seed(123)
train_control <- trainControl(method = "cv", number = k)
model <- train(orders ~ purchase_timestamp_week+diff_orders+
                  diff_morning+diff_afternoon+
                  diff_night+diff_early_morning,
               data = period_orders1,
               method = "glm.nb",
               trControl = train_control)

print(model)
## Negative Binomial Generalized Linear Model 
## 
## 608 samples
##   6 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 546, 547, 548, 548, 547, 547, ... 
## Resampling results across tuning parameters:
## 
##   link      RMSE      Rsquared   MAE     
##   identity  71.42328  0.2587004  55.34819
##   log       67.78825  0.3238347  51.87033
##   sqrt      65.07121  0.3564331  50.51995
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was link = sqrt.

總結

  1. 模擬效果改善一些,看似差個斜率。

  2. 交叉驗證比對後,平均誤差下降,表示預測值與實際值誤差有改善,且解釋力提升至25%。

各指標判斷:

Mean Absolute Error (MAE)

\[ \text{MAE} = \frac{1}{n} \sum_{i=1}^{n} |y_i - \hat{y}_i| = 55.34819 \]

Root Mean Squared Error (RMSE)

\[ \text{RMSE} = \sqrt{\frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2} = 71.42328 \] Coefficient of Determination (R²)

\[ R^2 = 1 - \frac{\sum_{i=1}^{n} (y_i - \hat{y}_i)^2}{\sum_{i=1}^{n} (y_i - \bar{y})^2} = 0.2587004 = 25.87\% \]

增加地區資訊建模

# 透過下單日期和星期將資料分組,並計算每小時的下單量
hour_orders_customer <- orders_customer3%>%
  group_by(purchase_timestamp_year,purchase_timestamp_month,
           purchase_timestamp_day,purchase_timestamp_week,
           purchase_timestamp_hour,customer_state) %>%
  summarise(orders = n())

hour_orders_customer <- hour_orders_customer %>%
  mutate(period = case_when(
    purchase_timestamp_hour >= 0 & purchase_timestamp_hour <= 6 ~ "early_morning",
    purchase_timestamp_hour > 6 & purchase_timestamp_hour <= 12 ~ "morning",
    purchase_timestamp_hour > 12 & purchase_timestamp_hour <= 18 ~ "afternoon",
    purchase_timestamp_hour > 18 & purchase_timestamp_hour <=24 ~ "night"
  ))

hour_orders_customer <- hour_orders_customer %>%
  mutate(state = case_when(
    customer_state =="SP" ~ "SP",
    customer_state =="RJ" ~ "RJ",
    customer_state =="MG" ~ "MG",
    TRUE ~ "others"
  ))
##計算period
period_state_suborders1 <- hour_orders_customer %>%
  group_by(purchase_timestamp_year,purchase_timestamp_month,
           purchase_timestamp_day,purchase_timestamp_week,period) %>%
  summarise(period_orders = sum(orders)) 

period_state_suborders1 <- period_state_suborders1 %>%
  pivot_wider(names_from = period, 
              values_from = period_orders, 
              values_fill = list(period_orders = 0))%>%
  mutate(orders = sum(morning+afternoon+night+early_morning))

period_state_suborders1$key <- c(1:nrow(period_state_suborders1))

##計算state
period_state_suborders2 <- hour_orders_customer %>%
  group_by(purchase_timestamp_year,purchase_timestamp_month,
           purchase_timestamp_day,purchase_timestamp_week,state) %>%
  summarise(state_orders = sum(orders)) 


period_state_suborders2 <- period_state_suborders2 %>%
  pivot_wider(names_from = state, 
              values_from = state_orders, 
              values_fill = list(state_orders = 0))%>%
  mutate(orders = sum(SP+RJ+MG+others))

period_state_suborders2$key <- c(1:nrow(period_state_suborders2))
sum(period_state_suborders2$orders-daily_orders$orders)
## [1] 0
#串接合併period_state_suborders1和period_state_suborders2
period_state_orders <- left_join( period_state_suborders1[,-c(9)],period_state_suborders2[,5:10], by = "key") 

period_state_orders <- period_state_orders[-c(334,335,336),]
#####增加地區資料period_state_orders#####
period_state_orders$lag1_orders <- dplyr::lag(period_state_orders$orders, 1)
period_state_orders$lag2_orders <- dplyr::lag(period_state_orders$orders, 2)
period_state_orders$diff_orders <- abs(period_state_orders$lag2_orders - period_state_orders$lag1_orders)

period_state_orders$lag1_morning <- dplyr::lag(period_state_orders$morning, 1)
period_state_orders$lag2_morning <- dplyr::lag(period_state_orders$morning, 2)
period_state_orders$diff_morning <- abs(period_state_orders$lag2_morning - period_state_orders$lag1_morning)

period_state_orders$lag1_afternoon <- dplyr::lag(period_state_orders$afternoon, 1)
period_state_orders$lag2_afternoon <- dplyr::lag(period_state_orders$afternoon, 2)
period_state_orders$diff_afternoon <- abs(period_state_orders$lag2_afternoon - period_state_orders$lag1_afternoon)

period_state_orders$lag1_night <- dplyr::lag(period_state_orders$night, 1)
period_state_orders$lag2_night <- dplyr::lag(period_state_orders$night, 2)
period_state_orders$diff_night <- abs(period_state_orders$lag2_night - period_state_orders$lag1_night)

period_state_orders$lag1_early_morning <- dplyr::lag(period_state_orders$early_morning, 1)
period_state_orders$lag2_early_morning <- dplyr::lag(period_state_orders$early_morning, 2)
period_state_orders$diff_early_morning <- abs(period_state_orders$lag2_early_morning - period_state_orders$lag1_early_morning)

period_state_orders$lag1_SP <- dplyr::lag(period_state_orders$SP, 1)
period_state_orders$lag2_SP <- dplyr::lag(period_state_orders$SP, 2)
period_state_orders$diff_SP <- abs(period_state_orders$lag2_SP - period_state_orders$lag1_SP)

period_state_orders$lag1_MG <- dplyr::lag(period_state_orders$MG, 1)
period_state_orders$lag2_MG <- dplyr::lag(period_state_orders$MG, 2)
period_state_orders$diff_MG <- abs(period_state_orders$lag2_MG - period_state_orders$lag1_MG)

period_state_orders$lag1_RJ <- dplyr::lag(period_state_orders$RJ, 1)
period_state_orders$lag2_RJ <- dplyr::lag(period_state_orders$RJ, 2)
period_state_orders$diff_RJ <- abs(period_state_orders$lag2_RJ - period_state_orders$lag1_RJ)

period_state_orders$lag1_others <- dplyr::lag(period_state_orders$others, 1)
period_state_orders$lag2_others <- dplyr::lag(period_state_orders$others, 2)
period_state_orders$diff_others <- abs(period_state_orders$lag2_others - period_state_orders$lag1_others)

period_state_orders1 <- na.omit(period_state_orders)
period_state_orders1$purchase_timestamp_week <- as.factor(period_state_orders1$purchase_timestamp_week)

period_state_orders2<- period_state_orders1[-c(368),]

4.逐步迴歸篩選

因變量變多,所以透過逐步迴歸方法篩選適合的變量

# 初始模型
initial_model <- glm.nb(orders ~ 1, data = period_state_orders2)
# 完全模型:包含所有可能的自變量
full_model <- glm.nb(orders ~ purchase_timestamp_week+diff_orders+
                       diff_morning+ diff_afternoon+ diff_night+ diff_early_morning+
                       lag1_morning+ lag1_afternoon+ lag1_night+ lag1_early_morning+
                       diff_SP+ diff_MG+ diff_RJ+ diff_others,
                     lag1_SP+ lag1_MG+ lag1_RJ+ lag1_others,
                     data = period_state_orders2)
# 前向逐步迴歸
forward_model <- step(initial_model, scope = list(lower = initial_model, upper = full_model), 
                      direction = "forward")
## Start:  AIC=7023.9
## orders ~ 1
## 
##                           Df Deviance    AIC
## + lag1_night               1   228.03 6609.1
## + lag1_afternoon           1   251.40 6632.5
## + lag1_morning             1   336.62 6717.7
## + lag1_early_morning       1   453.22 6834.3
## + diff_others              1   587.05 6968.1
## + diff_night               1   589.88 6970.9
## + diff_orders              1   590.14 6971.2
## + diff_afternoon           1   591.88 6972.9
## + diff_morning             1   592.03 6973.1
## + diff_SP                  1   597.74 6978.8
## + diff_MG                  1   598.47 6979.5
## + diff_RJ                  1   602.59 6983.6
## + purchase_timestamp_week  6   601.16 6992.2
## + diff_early_morning       1   622.43 7003.5
## <none>                         644.84 7023.9
## 
## Step:  AIC=6335.26
## orders ~ lag1_night
## 
##                           Df Deviance    AIC
## + lag1_afternoon           1   578.94 6246.9
## + lag1_morning             1   627.25 6295.2
## + lag1_early_morning       1   642.73 6310.7
## + diff_night               1   652.36 6320.4
## + diff_morning             1   658.43 6326.4
## + diff_RJ                  1   659.36 6327.4
## + diff_SP                  1   660.71 6328.7
## + purchase_timestamp_week  6   651.05 6329.0
## + diff_orders              1   662.42 6330.4
## + diff_MG                  1   663.10 6331.1
## + diff_others              1   664.14 6332.1
## <none>                         669.26 6335.3
## + diff_early_morning       1   667.44 6335.4
## + diff_afternoon           1   668.61 6336.6
## 
## Step:  AIC=6238.85
## orders ~ lag1_night + lag1_afternoon
## 
##                           Df Deviance    AIC
## + purchase_timestamp_week  6   601.17 6171.5
## + diff_night               1   672.40 6232.8
## <none>                         680.48 6238.8
## + lag1_early_morning       1   678.48 6238.9
## + diff_morning             1   678.92 6239.3
## + diff_RJ                  1   678.95 6239.3
## + diff_MG                  1   679.45 6239.8
## + diff_afternoon           1   679.78 6240.1
## + diff_others              1   679.86 6240.2
## + diff_SP                  1   680.03 6240.4
## + lag1_morning             1   680.13 6240.5
## + diff_orders              1   680.32 6240.7
## + diff_early_morning       1   680.37 6240.7
## 
## Step:  AIC=6164.73
## orders ~ lag1_night + lag1_afternoon + purchase_timestamp_week
## 
##                      Df Deviance    AIC
## + lag1_morning        1   684.14 6154.1
## + diff_night          1   691.74 6161.7
## + lag1_early_morning  1   691.79 6161.8
## <none>                    696.72 6164.7
## + diff_morning        1   695.47 6165.5
## + diff_others         1   696.01 6166.0
## + diff_MG             1   696.23 6166.2
## + diff_RJ             1   696.29 6166.3
## + diff_afternoon      1   696.37 6166.4
## + diff_SP             1   696.44 6166.4
## + diff_orders         1   696.46 6166.5
## + diff_early_morning  1   696.51 6166.5
## 
## Step:  AIC=6153.98
## orders ~ lag1_night + lag1_afternoon + purchase_timestamp_week + 
##     lag1_morning
## 
##                      Df Deviance    AIC
## + diff_night          1   694.47 6150.6
## + lag1_early_morning  1   696.85 6153.0
## <none>                    699.87 6154.0
## + diff_afternoon      1   699.02 6155.1
## + diff_morning        1   699.25 6155.4
## + diff_others         1   699.33 6155.4
## + diff_RJ             1   699.66 6155.8
## + diff_MG             1   699.68 6155.8
## + diff_SP             1   699.75 6155.9
## + diff_orders         1   699.81 6155.9
## + diff_early_morning  1   699.81 6155.9
## 
## Step:  AIC=6150.55
## orders ~ lag1_night + lag1_afternoon + purchase_timestamp_week + 
##     lag1_morning + diff_night
## 
##                      Df Deviance    AIC
## + lag1_early_morning  1   696.88 6149.1
## <none>                    700.29 6150.5
## + diff_afternoon      1   699.48 6151.7
## + diff_morning        1   699.84 6152.1
## + diff_orders         1   700.07 6152.3
## + diff_others         1   700.09 6152.3
## + diff_RJ             1   700.25 6152.5
## + diff_MG             1   700.25 6152.5
## + diff_SP             1   700.25 6152.5
## + diff_early_morning  1   700.28 6152.5
## 
## Step:  AIC=6149.13
## orders ~ lag1_night + lag1_afternoon + purchase_timestamp_week + 
##     lag1_morning + diff_night + lag1_early_morning
## 
##                      Df Deviance    AIC
## <none>                    700.83 6149.1
## + diff_afternoon      1   699.98 6150.3
## + diff_morning        1   700.14 6150.4
## + diff_others         1   700.38 6150.7
## + diff_MG             1   700.72 6151.0
## + diff_orders         1   700.73 6151.0
## + diff_RJ             1   700.78 6151.1
## + diff_SP             1   700.80 6151.1
## + diff_early_morning  1   700.81 6151.1
summary(forward_model )
## 
## Call:
## glm.nb(formula = orders ~ lag1_night + lag1_afternoon + purchase_timestamp_week + 
##     lag1_morning + diff_night + lag1_early_morning, data = period_state_orders2, 
##     init.theta = 16.2117696, link = log)
## 
## Coefficients:
##                            Estimate Std. Error z value Pr(>|z|)    
## (Intercept)               4.0908933  0.0392028 104.352  < 2e-16 ***
## lag1_night                0.0093107  0.0009775   9.525  < 2e-16 ***
## lag1_afternoon            0.0071557  0.0009763   7.329 2.32e-13 ***
## purchase_timestamp_week2 -0.2847990  0.0429577  -6.630 3.36e-11 ***
## purchase_timestamp_week3 -0.2967043  0.0431882  -6.870 6.42e-12 ***
## purchase_timestamp_week4 -0.2817291  0.0445871  -6.319 2.64e-10 ***
## purchase_timestamp_week5 -0.3382511  0.0451792  -7.487 7.05e-14 ***
## purchase_timestamp_week6 -0.4198728  0.0469256  -8.948  < 2e-16 ***
## purchase_timestamp_week7 -0.1329664  0.0426327  -3.119 0.001815 ** 
## lag1_morning              0.0037271  0.0011176   3.335 0.000853 ***
## diff_night                0.0028294  0.0011702   2.418 0.015616 *  
## lag1_early_morning        0.0059207  0.0031563   1.876 0.060678 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for Negative Binomial(16.2118) family taken to be 1)
## 
##     Null deviance: 2911.61  on 605  degrees of freedom
## Residual deviance:  700.83  on 594  degrees of freedom
## AIC: 6151.1
## 
## Number of Fisher Scoring iterations: 1
## 
## 
##               Theta:  16.21 
##           Std. Err.:  1.17 
## 
##  2 x log-likelihood:  -6125.127
# 後向逐步迴歸
backward_model <- step(full_model, direction = "backward")
## Start:  AIC=925719.3
## orders ~ purchase_timestamp_week + diff_orders + diff_morning + 
##     diff_afternoon + diff_night + diff_early_morning + lag1_morning + 
##     lag1_afternoon + lag1_night + lag1_early_morning + diff_SP + 
##     diff_MG + diff_RJ + diff_others
## 
##                           Df Deviance    AIC
## <none>                         102651 925719
## - diff_afternoon           1   102656 925722
## - diff_orders              1   102667 925733
## - diff_RJ                  1   102670 925736
## - diff_MG                  1   102682 925749
## - diff_early_morning       1   102760 925827
## - diff_others              1   102766 925832
## - diff_SP                  1   102804 925870
## - diff_morning             1   103000 926066
## - diff_night               1   103215 926282
## - lag1_early_morning       1   103268 926334
## - lag1_morning             1   107686 930752
## - lag1_afternoon           1   114527 937593
## - lag1_night               1   123273 946339
## - purchase_timestamp_week  6   135548 958605
summary(backward_model)
## 
## Call:
## glm.nb(formula = orders ~ purchase_timestamp_week + diff_orders + 
##     diff_morning + diff_afternoon + diff_night + diff_early_morning + 
##     lag1_morning + lag1_afternoon + lag1_night + lag1_early_morning + 
##     diff_SP + diff_MG + diff_RJ + diff_others, data = period_state_orders2, 
##     weights = lag1_SP + lag1_MG + lag1_RJ + lag1_others, init.theta = 38.47013631, 
##     link = log)
## 
## Coefficients:
##                            Estimate Std. Error  z value Pr(>|z|)    
## (Intercept)               4.415e+00  2.578e-03 1712.618  < 2e-16 ***
## purchase_timestamp_week2 -3.264e-01  2.683e-03 -121.630  < 2e-16 ***
## purchase_timestamp_week3 -3.080e-01  2.449e-03 -125.794  < 2e-16 ***
## purchase_timestamp_week4 -3.194e-01  2.544e-03 -125.557  < 2e-16 ***
## purchase_timestamp_week5 -3.708e-01  2.617e-03 -141.655  < 2e-16 ***
## purchase_timestamp_week6 -4.554e-01  2.791e-03 -163.138  < 2e-16 ***
## purchase_timestamp_week7 -1.804e-01  2.722e-03  -66.276  < 2e-16 ***
## diff_orders              -2.435e-04  6.114e-05   -3.982 6.85e-05 ***
## diff_morning              1.282e-03  6.864e-05   18.674  < 2e-16 ***
## diff_afternoon           -1.497e-04  6.795e-05   -2.202   0.0276 *  
## diff_night                1.511e-03  6.355e-05   23.780  < 2e-16 ***
## diff_early_morning       -1.778e-03  1.701e-04  -10.451  < 2e-16 ***
## lag1_morning              3.955e-03  5.620e-05   70.373  < 2e-16 ***
## lag1_afternoon            5.323e-03  4.900e-05  108.640  < 2e-16 ***
## lag1_night                7.030e-03  4.920e-05  142.890  < 2e-16 ***
## lag1_early_morning        3.961e-03  1.596e-04   24.814  < 2e-16 ***
## diff_SP                  -9.590e-04  7.808e-05  -12.283  < 2e-16 ***
## diff_MG                  -7.136e-04  1.277e-04   -5.589 2.29e-08 ***
## diff_RJ                  -5.882e-04  1.369e-04   -4.296 1.74e-05 ***
## diff_others               9.507e-04  8.903e-05   10.678  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for Negative Binomial(38.4701) family taken to be 1)
## 
##     Null deviance: 544468  on 605  degrees of freedom
## Residual deviance: 102651  on 586  degrees of freedom
## AIC: 925721
## 
## Number of Fisher Scoring iterations: 1
## 
## 
##               Theta:  38.470 
##           Std. Err.:  0.230 
## 
##  2 x log-likelihood:  -925679.316
# 逐步回歸
stepwise_model <- step(initial_model, scope = list(lower = initial_model, upper = full_model), 
                       direction = "both")
## Start:  AIC=7023.9
## orders ~ 1
## 
##                           Df Deviance    AIC
## + lag1_night               1   228.03 6609.1
## + lag1_afternoon           1   251.40 6632.5
## + lag1_morning             1   336.62 6717.7
## + lag1_early_morning       1   453.22 6834.3
## + diff_others              1   587.05 6968.1
## + diff_night               1   589.88 6970.9
## + diff_orders              1   590.14 6971.2
## + diff_afternoon           1   591.88 6972.9
## + diff_morning             1   592.03 6973.1
## + diff_SP                  1   597.74 6978.8
## + diff_MG                  1   598.47 6979.5
## + diff_RJ                  1   602.59 6983.6
## + purchase_timestamp_week  6   601.16 6992.2
## + diff_early_morning       1   622.43 7003.5
## <none>                         644.84 7023.9
## 
## Step:  AIC=6335.26
## orders ~ lag1_night
## 
##                           Df Deviance    AIC
## + lag1_afternoon           1   578.94 6246.9
## + lag1_morning             1   627.25 6295.2
## + lag1_early_morning       1   642.73 6310.7
## + diff_night               1   652.36 6320.4
## + diff_morning             1   658.43 6326.4
## + diff_RJ                  1   659.36 6327.4
## + diff_SP                  1   660.71 6328.7
## + purchase_timestamp_week  6   651.05 6329.0
## + diff_orders              1   662.42 6330.4
## + diff_MG                  1   663.10 6331.1
## + diff_others              1   664.14 6332.1
## <none>                         669.26 6335.3
## + diff_early_morning       1   667.44 6335.4
## + diff_afternoon           1   668.61 6336.6
## - lag1_night               1  1992.21 7656.2
## 
## Step:  AIC=6238.85
## orders ~ lag1_night + lag1_afternoon
## 
##                           Df Deviance    AIC
## + purchase_timestamp_week  6   601.17 6171.5
## + diff_night               1   672.40 6232.8
## <none>                         680.48 6238.8
## + lag1_early_morning       1   678.48 6238.9
## + diff_morning             1   678.92 6239.3
## + diff_RJ                  1   678.95 6239.3
## + diff_MG                  1   679.45 6239.8
## + diff_afternoon           1   679.78 6240.1
## + diff_others              1   679.86 6240.2
## + diff_SP                  1   680.03 6240.4
## + lag1_morning             1   680.13 6240.5
## + diff_orders              1   680.32 6240.7
## + diff_early_morning       1   680.37 6240.7
## - lag1_afternoon           1   788.05 6344.4
## - lag1_night               1   879.16 6435.5
## 
## Step:  AIC=6164.73
## orders ~ lag1_night + lag1_afternoon + purchase_timestamp_week
## 
##                           Df Deviance    AIC
## + lag1_morning             1   684.14 6154.1
## + diff_night               1   691.74 6161.7
## + lag1_early_morning       1   691.79 6161.8
## <none>                         696.72 6164.7
## + diff_morning             1   695.47 6165.5
## + diff_others              1   696.01 6166.0
## + diff_MG                  1   696.23 6166.2
## + diff_RJ                  1   696.29 6166.3
## + diff_afternoon           1   696.37 6166.4
## + diff_SP                  1   696.44 6166.4
## + diff_orders              1   696.46 6166.5
## + diff_early_morning       1   696.51 6166.5
## - purchase_timestamp_week  6   790.59 6246.6
## - lag1_night               1   813.05 6279.1
## - lag1_afternoon           1   891.42 6357.4
## 
## Step:  AIC=6153.98
## orders ~ lag1_night + lag1_afternoon + purchase_timestamp_week + 
##     lag1_morning
## 
##                           Df Deviance    AIC
## + diff_night               1   694.47 6150.6
## + lag1_early_morning       1   696.85 6153.0
## <none>                         699.87 6154.0
## + diff_afternoon           1   699.02 6155.1
## + diff_morning             1   699.25 6155.4
## + diff_others              1   699.33 6155.4
## + diff_RJ                  1   699.66 6155.8
## + diff_MG                  1   699.68 6155.8
## + diff_SP                  1   699.75 6155.9
## + diff_orders              1   699.81 6155.9
## + diff_early_morning       1   699.81 6155.9
## - lag1_morning             1   712.79 6164.9
## - lag1_afternoon           1   767.35 6219.5
## - lag1_night               1   797.75 6249.9
## - purchase_timestamp_week  6   808.73 6250.8
## 
## Step:  AIC=6150.55
## orders ~ lag1_night + lag1_afternoon + purchase_timestamp_week + 
##     lag1_morning + diff_night
## 
##                           Df Deviance    AIC
## + lag1_early_morning       1   696.88 6149.1
## <none>                         700.29 6150.5
## + diff_afternoon           1   699.48 6151.7
## + diff_morning             1   699.84 6152.1
## + diff_orders              1   700.07 6152.3
## + diff_others              1   700.09 6152.3
## + diff_RJ                  1   700.25 6152.5
## + diff_MG                  1   700.25 6152.5
## + diff_SP                  1   700.25 6152.5
## + diff_early_morning       1   700.28 6152.5
## - diff_night               1   705.75 6154.0
## - lag1_morning             1   713.66 6161.9
## - lag1_afternoon           1   761.20 6209.5
## - lag1_night               1   795.45 6243.7
## - purchase_timestamp_week  6   806.13 6244.4
## 
## Step:  AIC=6149.13
## orders ~ lag1_night + lag1_afternoon + purchase_timestamp_week + 
##     lag1_morning + diff_night + lag1_early_morning
## 
##                           Df Deviance    AIC
## <none>                         700.83 6149.1
## + diff_afternoon           1   699.98 6150.3
## + diff_morning             1   700.14 6150.4
## - lag1_early_morning       1   704.26 6150.6
## + diff_others              1   700.38 6150.7
## + diff_MG                  1   700.72 6151.0
## + diff_orders              1   700.73 6151.0
## + diff_RJ                  1   700.78 6151.1
## + diff_SP                  1   700.80 6151.1
## + diff_early_morning       1   700.81 6151.1
## - diff_night               1   706.67 6153.0
## - lag1_morning             1   712.08 6158.4
## - lag1_afternoon           1   755.00 6201.3
## - lag1_night               1   790.83 6237.1
## - purchase_timestamp_week  6   807.41 6243.7
summary(stepwise_model)
## 
## Call:
## glm.nb(formula = orders ~ lag1_night + lag1_afternoon + purchase_timestamp_week + 
##     lag1_morning + diff_night + lag1_early_morning, data = period_state_orders2, 
##     init.theta = 16.2117696, link = log)
## 
## Coefficients:
##                            Estimate Std. Error z value Pr(>|z|)    
## (Intercept)               4.0908933  0.0392028 104.352  < 2e-16 ***
## lag1_night                0.0093107  0.0009775   9.525  < 2e-16 ***
## lag1_afternoon            0.0071557  0.0009763   7.329 2.32e-13 ***
## purchase_timestamp_week2 -0.2847990  0.0429577  -6.630 3.36e-11 ***
## purchase_timestamp_week3 -0.2967043  0.0431882  -6.870 6.42e-12 ***
## purchase_timestamp_week4 -0.2817291  0.0445871  -6.319 2.64e-10 ***
## purchase_timestamp_week5 -0.3382511  0.0451792  -7.487 7.05e-14 ***
## purchase_timestamp_week6 -0.4198728  0.0469256  -8.948  < 2e-16 ***
## purchase_timestamp_week7 -0.1329664  0.0426327  -3.119 0.001815 ** 
## lag1_morning              0.0037271  0.0011176   3.335 0.000853 ***
## diff_night                0.0028294  0.0011702   2.418 0.015616 *  
## lag1_early_morning        0.0059207  0.0031563   1.876 0.060678 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for Negative Binomial(16.2118) family taken to be 1)
## 
##     Null deviance: 2911.61  on 605  degrees of freedom
## Residual deviance:  700.83  on 594  degrees of freedom
## AIC: 6151.1
## 
## Number of Fisher Scoring iterations: 1
## 
## 
##               Theta:  16.21 
##           Std. Err.:  1.17 
## 
##  2 x log-likelihood:  -6125.127
AIC(forward_model)#
## [1] 6151.127
AIC(backward_model)
## [1] 925721.3
AIC(stepwise_model)#
## [1] 6151.127
# 比较模型的BIC值
BIC(forward_model)#
## [1] 6208.416
BIC(backward_model)
## [1] 925813.9
BIC(stepwise_model)#
## [1] 6208.416

透過 AIC 與 BIC 值比較後,採用stepwise_model下得出的結論建立模型:
orders ~ lag1_night + lag1_afternoon + purchase_timestamp_week + lag1_morning + diff_night + lag1_early_morning

5.建立模型4

模型4

\(y\) ~ \(week + lag1\_morning + lag1\_afternoon + lag1\_night + lag1\_early\_morning + diff\_night\)

negbin_model4 <- glm.nb(orders ~ lag1_night + lag1_afternoon + purchase_timestamp_week + 
    lag1_morning + diff_night + lag1_early_morning,
                        data = period_state_orders2)
par(mfrow = c(2, 2))
summary(negbin_model4)
## 
## Call:
## glm.nb(formula = orders ~ lag1_night + lag1_afternoon + purchase_timestamp_week + 
##     lag1_morning + diff_night + lag1_early_morning, data = period_state_orders2, 
##     init.theta = 16.2117696, link = log)
## 
## Coefficients:
##                            Estimate Std. Error z value Pr(>|z|)    
## (Intercept)               4.0908933  0.0392028 104.352  < 2e-16 ***
## lag1_night                0.0093107  0.0009775   9.525  < 2e-16 ***
## lag1_afternoon            0.0071557  0.0009763   7.329 2.32e-13 ***
## purchase_timestamp_week2 -0.2847990  0.0429577  -6.630 3.36e-11 ***
## purchase_timestamp_week3 -0.2967043  0.0431882  -6.870 6.42e-12 ***
## purchase_timestamp_week4 -0.2817291  0.0445871  -6.319 2.64e-10 ***
## purchase_timestamp_week5 -0.3382511  0.0451792  -7.487 7.05e-14 ***
## purchase_timestamp_week6 -0.4198728  0.0469256  -8.948  < 2e-16 ***
## purchase_timestamp_week7 -0.1329664  0.0426327  -3.119 0.001815 ** 
## lag1_morning              0.0037271  0.0011176   3.335 0.000853 ***
## diff_night                0.0028294  0.0011702   2.418 0.015616 *  
## lag1_early_morning        0.0059207  0.0031563   1.876 0.060678 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for Negative Binomial(16.2118) family taken to be 1)
## 
##     Null deviance: 2911.61  on 605  degrees of freedom
## Residual deviance:  700.83  on 594  degrees of freedom
## AIC: 6151.1
## 
## Number of Fisher Scoring iterations: 1
## 
## 
##               Theta:  16.21 
##           Std. Err.:  1.17 
## 
##  2 x log-likelihood:  -6125.127
plot(negbin_model4)

Residuals plot:

  1. Residuals vs Fitting value:Residuals 分佈在0周圍,資料點呈現向上凸,存在某種趨勢。

  2. QQ plot:右尾有些偏離狀況,表示Residuals 非完全為常態分佈,右尾存在一些異常值。

  3. scale location plot:圖呈現下凸,似乎有某種趨勢。

  4. residuals vs leverage plot:此圖表示沒有零星高槓桿點與離群值。

結論:

  1. 模型擬合情況:存在一些趨勢。

  2. Residuals分佈檢驗:右尾呈現偏離狀況,存在異常點或極端值。

  3. Scale-Location:表明Residuals 的變異數不同,存在異質變異數問題。

  4. 異常值和高槓桿點:不存在高槓桿點。

總體上增加了模型解釋力,但存在某種關係。

檢驗

set.seed(123)
train_index <- createDataPartition(period_state_orders2$orders, p = 0.7, list = FALSE)
train_data <- period_state_orders2[train_index, ]
test_data <- period_state_orders2[-train_index, ]
# train_data 
model <- glm.nb(formula = orders ~ lag1_night + lag1_afternoon + purchase_timestamp_week + 
    lag1_morning + diff_night + lag1_early_morning,
                data = train_data, link = log)
summary(model)
## 
## Call:
## glm.nb(formula = orders ~ lag1_night + lag1_afternoon + purchase_timestamp_week + 
##     lag1_morning + diff_night + lag1_early_morning, data = train_data, 
##     link = log, init.theta = 14.76368879)
## 
## Coefficients:
##                           Estimate Std. Error z value Pr(>|z|)    
## (Intercept)               4.033004   0.048960  82.374  < 2e-16 ***
## lag1_night                0.010284   0.001236   8.323  < 2e-16 ***
## lag1_afternoon            0.007684   0.001247   6.161 7.21e-10 ***
## purchase_timestamp_week2 -0.242079   0.053033  -4.565 5.00e-06 ***
## purchase_timestamp_week3 -0.257723   0.054269  -4.749 2.04e-06 ***
## purchase_timestamp_week4 -0.245121   0.056131  -4.367 1.26e-05 ***
## purchase_timestamp_week5 -0.312144   0.055484  -5.626 1.85e-08 ***
## purchase_timestamp_week6 -0.349786   0.060584  -5.774 7.76e-09 ***
## purchase_timestamp_week7 -0.086067   0.051774  -1.662   0.0964 .  
## lag1_morning              0.002721   0.001487   1.830   0.0673 .  
## diff_night                0.002468   0.001525   1.619   0.1055    
## lag1_early_morning        0.005518   0.003925   1.406   0.1598    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for Negative Binomial(14.7637) family taken to be 1)
## 
##     Null deviance: 1911.26  on 425  degrees of freedom
## Residual deviance:  496.53  on 414  degrees of freedom
## AIC: 4368.4
## 
## Number of Fisher Scoring iterations: 1
## 
## 
##               Theta:  14.76 
##           Std. Err.:  1.27 
## 
##  2 x log-likelihood:  -4342.396
predictions <- predict(model, test_data, type = "response")

# 評估模型準確度
mse <- mean((test_data$orders - predictions)^2)
rmse <- sqrt(mse)
aic <- AIC(model)

print(paste("MSE: ", mse))
## [1] "MSE:  2548.18776396762"
print(paste("RMSE: ", rmse))
## [1] "RMSE:  50.4795776920491"
print(paste("AIC: ", aic))
## [1] "AIC:  4368.39560058943"
test_data$predicted_orders <- predict(model, newdata = test_data, type = "response")
ggplot(test_data, aes(x = orders, y = predicted_orders)) +
  geom_point() +
  geom_abline(slope = 1, intercept = 0, color = "red", linetype = "dashed") +
  labs(title = "Actual vs. Predicted Orders",
       x = "Actual Orders",
       y = "Predicted Orders") +
  theme_minimal()

交叉驗證:k = 10

# 訓練模型+交叉驗證
k <- 10
set.seed(123)
train_control <- trainControl(method = "cv", number = k)
model <- train(orders ~ lag1_night + lag1_afternoon + purchase_timestamp_week + 
    lag1_morning + diff_night + lag1_early_morning,
               data = period_state_orders2,
               method = "glm.nb",
               trControl = train_control)

print(model)
## Negative Binomial Generalized Linear Model 
## 
## 606 samples
##   6 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 546, 546, 546, 545, 546, 544, ... 
## Resampling results across tuning parameters:
## 
##   link      RMSE      Rsquared   MAE     
##   identity       NaN        NaN       NaN
##   log       42.21194  0.8080095  28.01818
##   sqrt      32.58018  0.8633834  23.63638
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was link = sqrt.

總結

  1. 模擬效果改善許多,斜率變陡,可是預測大一點的值誤差會變大。

  2. 交叉驗證比對後,平均誤差大幅下降,表示預測值與實際值誤差改善很多,且解釋力提升至80%。

各指標判斷:

Mean Absolute Error (MAE)

\[ \text{MAE} = \frac{1}{n} \sum_{i=1}^{n} |y_i - \hat{y}_i| = 28.01818 \]

Root Mean Squared Error (RMSE)

\[ \text{RMSE} = \sqrt{\frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2} = 42.21194 \] Coefficient of Determination (R²)

\[ R^2 = 1 - \frac{\sum_{i=1}^{n} (y_i - \hat{y}_i)^2}{\sum_{i=1}^{n} (y_i - \bar{y})^2} = 0.8080095 = 80.8\% \]

6.建立模型5

模型5

\(y\) ~ \(week +lag1\_morning +\sqrt{\text{lag1_SP}}+\sqrt{\text{lag1_RJ}}+\sqrt{\text{lag1_others}}\)

negbin_model5 <- glm.nb(orders~  purchase_timestamp_week+
                          lag1_morning +
                          sqrt(lag1_SP)+ sqrt(lag1_RJ)+ sqrt(lag1_others),
                        data = period_state_orders2)
summary(negbin_model5)
## 
## Call:
## glm.nb(formula = orders ~ purchase_timestamp_week + lag1_morning + 
##     sqrt(lag1_SP) + sqrt(lag1_RJ) + sqrt(lag1_others), data = period_state_orders2, 
##     init.theta = 34.71383173, link = log)
## 
## Coefficients:
##                            Estimate Std. Error z value Pr(>|z|)    
## (Intercept)               2.8420912  0.0556051  51.112  < 2e-16 ***
## purchase_timestamp_week2 -0.2490624  0.0306572  -8.124 4.51e-16 ***
## purchase_timestamp_week3 -0.2651266  0.0307252  -8.629  < 2e-16 ***
## purchase_timestamp_week4 -0.2552279  0.0313604  -8.139 4.00e-16 ***
## purchase_timestamp_week5 -0.3222615  0.0313969 -10.264  < 2e-16 ***
## purchase_timestamp_week6 -0.4142629  0.0318938 -12.989  < 2e-16 ***
## purchase_timestamp_week7 -0.1468184  0.0298879  -4.912 9.00e-07 ***
## lag1_morning             -0.0073819  0.0009091  -8.120 4.66e-16 ***
## sqrt(lag1_SP)             0.1416348  0.0090050  15.728  < 2e-16 ***
## sqrt(lag1_RJ)             0.1021541  0.0130552   7.825 5.09e-15 ***
## sqrt(lag1_others)         0.1566234  0.0101773  15.390  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for Negative Binomial(34.7138) family taken to be 1)
## 
##     Null deviance: 5383.86  on 605  degrees of freedom
## Residual deviance:  729.36  on 595  degrees of freedom
## AIC: 5793.3
## 
## Number of Fisher Scoring iterations: 1
## 
## 
##               Theta:  34.71 
##           Std. Err.:  2.84 
## 
##  2 x log-likelihood:  -5769.275

表中顯示每個變數皆為顯著性變數。

par(mfrow = c(2, 2))
plot(negbin_model5)

Residuals plot:

  1. Residuals vs Fitting value:Residuals 分佈在0周圍,比模型4削弱一點趨勢。

  2. QQ plot:右尾有些偏離狀況,表示Residuals 非完全為常態分佈,右尾存在一些異常值。

  3. scale location plot:分佈在0周圍,比模型4削弱一點趨勢。

  4. residuals vs leverage plot:此圖表示沒有零星高槓桿點與離群值。

結論:

  1. 模型擬合情況:存在些微趨勢。

  2. Residuals分佈檢驗:右尾呈現偏離狀況,存在異常點或極端值。

  3. Scale-Location:表明Residuals 的變異數有些微不同,存在些微異質變異數問題。

  4. 異常值和高槓桿點:不存在高槓桿點。

總體上增加了模型解釋力,但存在某種關係。

檢驗

set.seed(123)
train_index <- createDataPartition(period_state_orders2$orders, p = 0.7, list = FALSE)
train_data <- period_state_orders2[train_index, ]
test_data <- period_state_orders2[-train_index, ]
# train_data 
model <- glm.nb(formula = orders ~ purchase_timestamp_week+
                  lag1_morning +
                  sqrt(lag1_SP)+ sqrt(lag1_RJ)+ sqrt(lag1_others),
                data = train_data, link = log)
summary(model)
## 
## Call:
## glm.nb(formula = orders ~ purchase_timestamp_week + lag1_morning + 
##     sqrt(lag1_SP) + sqrt(lag1_RJ) + sqrt(lag1_others), data = train_data, 
##     link = log, init.theta = 32.84952412)
## 
## Coefficients:
##                           Estimate Std. Error z value Pr(>|z|)    
## (Intercept)               2.695129   0.069360  38.857  < 2e-16 ***
## purchase_timestamp_week2 -0.204346   0.037082  -5.511 3.58e-08 ***
## purchase_timestamp_week3 -0.237659   0.037707  -6.303 2.92e-10 ***
## purchase_timestamp_week4 -0.208892   0.038622  -5.409 6.35e-08 ***
## purchase_timestamp_week5 -0.288544   0.037982  -7.597 3.03e-14 ***
## purchase_timestamp_week6 -0.360307   0.040071  -8.992  < 2e-16 ***
## purchase_timestamp_week7 -0.100134   0.035257  -2.840  0.00451 ** 
## lag1_morning             -0.009504   0.001162  -8.176 2.93e-16 ***
## sqrt(lag1_SP)             0.159265   0.010776  14.780  < 2e-16 ***
## sqrt(lag1_RJ)             0.116784   0.015995   7.301 2.85e-13 ***
## sqrt(lag1_others)         0.156663   0.012092  12.956  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for Negative Binomial(32.8495) family taken to be 1)
## 
##     Null deviance: 3658.30  on 425  degrees of freedom
## Residual deviance:  521.75  on 415  degrees of freedom
## AIC: 4104.8
## 
## Number of Fisher Scoring iterations: 1
## 
## 
##               Theta:  32.85 
##           Std. Err.:  3.20 
## 
##  2 x log-likelihood:  -4080.775
predictions <- predict(model, test_data, type = "response")

# 評估模型準確度
mse <- mean((test_data$orders - predictions)^2)
rmse <- sqrt(mse)
aic <- AIC(model)

print(paste("MSE: ", mse))
## [1] "MSE:  930.759732232672"
print(paste("RMSE: ", rmse))
## [1] "RMSE:  30.508355121715"
print(paste("AIC: ", aic))
## [1] "AIC:  4104.77492230407"
test_data$predicted_orders <- predict(model, newdata = test_data, type = "response")
ggplot(test_data, aes(x = orders, y = predicted_orders)) +
  geom_point() +
  geom_abline(slope = 1, intercept = 0, color = "red", linetype = "dashed") +
  labs(title = "Actual vs. Predicted Orders",
       x = "Actual Orders",
       y = "Predicted Orders") +
  theme_minimal()

交叉驗證:k = 10

# 訓練模型+交叉驗證
k <- 10
set.seed(123)
train_control <- trainControl(method = "cv", number = k)
model <- train(orders ~ purchase_timestamp_week+
                  lag1_morning +
                  sqrt(lag1_SP)+ sqrt(lag1_RJ)+ sqrt(lag1_others),
               data = period_state_orders2,
               method = "glm.nb",
               trControl = train_control)

print(model)
## Negative Binomial Generalized Linear Model 
## 
## 606 samples
##   5 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 546, 546, 546, 545, 546, 544, ... 
## Resampling results across tuning parameters:
## 
##   link      RMSE      Rsquared   MAE     
##   identity       NaN        NaN       NaN
##   log       27.98589  0.8860251  20.97589
##   sqrt      26.42287  0.8903867  20.21658
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was link = sqrt.

總結

  1. 模擬效果更進一步提升,消除一些趨勢,改善預測值大一點時的誤差情況。

  2. 交叉驗證比對後,平均誤差大幅下降,表示預測值與實際值誤差改善很多,且解釋力提升至88%。

各指標判斷:

Mean Absolute Error (MAE)

\[ \text{MAE} = \frac{1}{n} \sum_{i=1}^{n} |y_i - \hat{y}_i| = 20.97589 \]

Root Mean Squared Error (RMSE)

\[ \text{RMSE} = \sqrt{\frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2} = 27.98589 \] Coefficient of Determination (R²)

\[ R^2 = 1 - \frac{\sum_{i=1}^{n} (y_i - \hat{y}_i)^2}{\sum_{i=1}^{n} (y_i - \bar{y})^2} = 0.8860251 = 88.6\% \]

結論

  1. 訂單量顯著受星期、昨日早晨訂單量、昨日SP州訂單量、昨日RJ州訂單量、昨日其他州訂單量影響。
  2. \(y\) ~ \(week +lag1\_morning +\sqrt{\text{lag1_SP}}+\sqrt{\text{lag1_RJ}}+\sqrt{\text{lag1_others}}\) 模型,透過這些變數可解釋銷售量大約88%的差異。

k-means 聚類

取中點

刪除海外定點(異常點)

#刪除異常點
a<- customer_sellers_zip1[which(customer_sellers_zip1$customer_lng>-41 & customer_sellers_zip1$customer_lat>-2),]
a_sf <- st_as_sf(a, coords = c("customer_lng", "customer_lat"), crs = 4326)
# 計算每條路線
#折中線圖
ggplot() +
  geom_sf(data = brazil_states, fill = "white", color = "black") +
  geom_sf(data = a_sf, size = 2,color = "red") +
  theme_minimal() +
  labs(color = "Location Type")

customer_sellers_zip1 <- customer_sellers_zip1[-which(customer_sellers_zip1$customer_lng>-41 & customer_sellers_zip1$customer_lat>-2),]
library(geosphere)
library(maps) #世界各地地圖
library(rnaturalearth)
library(rnaturalearthhires) # ne_states
library(sf) #simple feature
brazil <- ne_countries(country = "Brazil", returnclass = "sf")
brazil_states <- ne_states(country = "Brazil", returnclass = "sf")

customer_geo <- data.frame(
  customer_id = customer_sellers_zip1$customer_id,
  customer_lat = customer_sellers_zip1$customer_lat,
  customer_lng = customer_sellers_zip1$customer_lng
)
customer_geo <-customer_geo[1:3,]#10萬筆跑太久,[1:100,]是示範

sellers_geo <- data.frame(
  seller_id = customer_sellers_zip1$seller_id,
  seller_lat = customer_sellers_zip1$seller_lat,
  seller_lng = customer_sellers_zip1$seller_lng
)
sellers_geo <-sellers_geo[1:3,]


a <- data.frame(
  customer_lat = customer_sellers_zip1$customer_lat,
  seller_lat = customer_sellers_zip1$seller_lat
  )
b <- data.frame(
  customer_lng = customer_sellers_zip1$customer_lng,
  seller_lng = customer_sellers_zip1$seller_lng
)

customer_sellers_meanzip<- customer_sellers_zip1 %>% mutate(
  mean_lat = apply(a , 1, mean),
  mean_lng = apply(b , 1, mean)
)

mean_geo <- data.frame(
  mean_lat = customer_sellers_meanzip$mean_lat,
  mean_lng = customer_sellers_meanzip$mean_lng
)
mean_geo <-mean_geo[1:3,]
#
customer_geo_sf <- st_as_sf(customer_geo, coords = c("customer_lng", "customer_lat"), crs = 4326)
sellers_geo_sf <- st_as_sf(sellers_geo, coords = c("seller_lng", "seller_lat"), crs = 4326)
mean_geo_sf <- st_as_sf(mean_geo, coords = c("mean_lng", "mean_lat"), crs = 4326)
# 計算每條路線
routes <- data.frame()
for (i in 1:nrow(customer_geo)) {
  route <- gcIntermediate(c(customer_geo$customer_lng[i], customer_geo$customer_lat[i]),
                          c(sellers_geo$seller_lng[i], sellers_geo$seller_lat[i]),
                          n = 50, addStartEnd = TRUE, sp = TRUE)
  route <- st_as_sf(route)
  route$customer_id <- customer_geo$customer_id[i]
  routes <- rbind(routes, route)
}
ggplot() +
  geom_sf(data = brazil_states, fill = "white", color = "black") +
  geom_sf(data = customer_geo_sf, aes(color = "customers"), size = 2) +
  geom_sf(data = sellers_geo_sf , aes(color = "Sellers"), size = 2) +
  geom_sf(data = routes, color = "blue") +
  scale_color_manual(values = c("customers" = "red", "Sellers" = "green")) +
  theme_minimal() +
  labs(title = "The distance between the customers and the sellers", 
       x = "Longitude", y = "Latitude",
       color = "Location Type")   

#折中線圖
ggplot() +
  geom_sf(data = brazil_states, fill = "white", color = "black") +
  geom_sf(data = customer_geo_sf, aes(color = "customers"), size = 2) +
  geom_sf(data = sellers_geo_sf , aes(color = "Sellers"), size = 2) +
  geom_sf(data = mean_geo_sf , aes(color = "Midpoint"), size = 2) +
  geom_sf(data = routes, color = "blue") +
  scale_color_manual(values = c("customers" = "red", "Sellers" = "green","Midpoint"="orange")) +
  theme_minimal() +
  labs(title = "The distance between the customers and the sellers", 
       x = "Longitude", y = "Latitude",
       color = "Location Type")   

#畫出mean點圖
mean_geo <- data.frame(
  mean_lat = customer_sellers_meanzip$mean_lat,
  mean_lng = customer_sellers_meanzip$mean_lng
)

mean_geo_sf <- st_as_sf(mean_geo, coords = c("mean_lng", "mean_lat"), crs = 4326)

ggplot() +
  geom_sf(data = brazil_states, fill = "white", color = "black") +
  geom_sf(data = mean_geo_sf , aes(color = "Midpoint"), size = 1) +
  scale_color_manual(values = c("Midpoint"="orange")) +
  theme_minimal() +
  labs(title = "Midpoint of position", 
       x = "Longitude", y = "Latitude",
       color = "Location Type")   

kmeans k=1

# kmeans k=1
k <- 1
kmeans_result1 <- kmeans(mean_geo[,1:2], centers=k)
mean_geo$cluster <- as.factor(kmeans_result1$cluster)
mean_geo_sf1 <- st_as_sf(mean_geo, coords = c("mean_lng", "mean_lat"), crs = 4326)
centers_df1 <- as.data.frame(kmeans_result1$centers)
centers_df1$cluster <- factor(1:nrow(centers_df1))  # 添加cluster標籤
unique(centers_df1)
##    mean_lat mean_lng cluster
## 1 -22.02344 -46.7283       1
#k=1
ggplot() +
  geom_sf(data = brazil_states, fill = "white", color = "black") +
  geom_sf(data = mean_geo_sf1, aes(color = cluster), size = 1, color = "orange") + 
  geom_point(data = as.data.frame(kmeans_result1$centers), aes(x = mean_lng, y = mean_lat),
             color = 'red', size = 5, shape = 4) +
  geom_text(data = centers_df1,
            aes(x = mean_lng, y = mean_lat,
                label = paste("(",round(mean_lng, 2),",", round(mean_lat, 2), ")", sep = ""))) +
  labs(title = paste("K-means cluster (K=", k, ")", sep=""), 
       x = "Longitude", y = "Latitude") +  
  labs(color = "cluster")+
  theme_minimal()

kmeans k=3

k <- 3
set.seed(12)
kmeans_result3 <- kmeans(mean_geo[,1:2], centers=k)
mean_geo$cluster <- as.factor(kmeans_result3$cluster)
mean_geo_sf3 <- st_as_sf(mean_geo, coords = c("mean_lng", "mean_lat"), crs = 4326)
centers_df3 <- as.data.frame(kmeans_result3$centers)
centers_df3$cluster <- factor(1:nrow(centers_df3))  # 添加cluster標籤
unique(centers_df3)
##    mean_lat  mean_lng cluster
## 1 -23.68930 -48.09046       1
## 2 -21.60400 -45.52332       2
## 3 -14.90165 -43.49890       3
ggplot() +
  geom_sf(data = brazil_states, fill = "white", color = "black") +  
  geom_sf(data = mean_geo_sf3, aes(color = cluster), size = 1) + 
  geom_point(data = as.data.frame(kmeans_result3$centers), aes(x = mean_lng, y = mean_lat), 
             color = 'red', size = 5, shape = 4) + 
  geom_text(data = centers_df3, 
            aes(x = mean_lng, y = mean_lat,label = paste("(",round(mean_lng, 2),",", round(mean_lat, 2), ")", sep = "")), 
            vjust = -1, hjust = 0.5, color = "black", size = 3) + 
  labs(title = paste("K-means cluster (K=", k, ")", sep=""), x = "Longitude", y = "Latitude") + 
  theme_minimal() 

決定K值

1. Elbow Method

觀察WCSS(Within-Cluster Sum of Squares)選取適合的cluster個數
\[ W(C_k) = \sum_{x_i \in C_k}^{129} (x_i - \mu_k)^2, \quad \mu_k = E(x_i) \text{ where } x_i \in C_k \]

\[ minimize (\text{tot.withiness}) = minimize \left( \sum_{k=1} W(C_k) \right) = minimize \left( \sum_{k=1} \sum_{x_i \in C_k} (x_i - \mu_k)^2 \right) \]

概念:
\(\qquad\) 繪製cluster 數目與總聚類內平方和 (WCSS) ,觀察WCSS隨cluster個數的變化。
\(\qquad\) \(\text{tot.withiness}\) 衡量聚類的緊密度,希望它盡可能小。

方法:
\(\qquad\) Step 1:計算不同 k 值的聚類演算法(例如,k - means cluster)。例如,透過將 k 從 1 個群變更為 10 個群。

\(\qquad\) Step 2:對於每個 k 值,計算群內總平方和 (WCSS)。

\(\qquad\) Step 3:根據群數 k 繪製WCSS曲線。

\(\qquad\) Step 4:圖中彎曲點的位置通常被認為是適當群數的指標。

# 使用Elbow Method 確定合適K值
wss <- (nrow(mean_geo)-1)*sum(apply(mean_geo[,1:2], 2, var))
for (i in 2:30) {
  kmeans_result <- kmeans(mean_geo[, 1:2], centers = i, nstart = 50, iter.max = 100)
  wss[i] <- sum(kmeans_result$withinss)
}

k <- 30
wss_df <- tibble(clusters = 1:k, wss = wss)
scree_plot <- ggplot(wss_df, aes(x = clusters, y = wss, group = 1)) +
  geom_point(size = 3)+
  geom_line() +
  scale_x_continuous(breaks = c(1:30)) +
  xlab("Number of clusters")+
  ylab("Within-Cluster Sum of Squares")+
  labs(title = "Within-Cluster Sum of Squares under different K values")
scree_plot +geom_hline(
  yintercept = wss,
  linetype = 'dashed',
  col = c(rep('#000000',4),'#FF0000', rep('#000000', 25))
  )

kmeans_result <- kmeans(mean_geo[,1:2], centers=k)

當WCSS 隨著 k 增加出現顯著誤差驟降的位置時,此K值為最優cluster個數。

如圖:此時 k = 5後,WCSS下降速度開始緩慢許多,是一個可以考量的選擇,因此選擇K=5。

2. Silhouette method

Average Silhouette Method( Silhouette Coefficient)

評估cluster品質

概念:
\(\qquad\) 1. 確定每個觀測值在其群內的分佈。較高的 Silhouette Coefficient 表示良好的聚類。

\(\qquad\) 2. Average Silhouette Method 計算不同 k 值的觀測值的平均 Silhouette。

\(\qquad\) 3. 最佳群數 k 是在 k 的一系列可能值上最大化平均 Silhouette 的群數。

\(\qquad\) 4. Silhouette Coefficient \(s(i)\) 定義為:
\[ s(i) = \frac{b(i) - a(i)}{\max\{a(i), b(i)\}} , \quad S(i) \in [-1, 1] \]

\(\qquad\) \(a(i)\) = 第 i 觀測值與群內其他資料點的平均距離(即內部距離)

\(\qquad\) \(b(i)\) = 第 i 觀測值到最近的其他聚類的平均距離(即外部距離)

\(\qquad\) \(s(i)\)) = 1 (聚類內的點非常相似)

\(\qquad\) \(s(i)\) = 0 (觀測值剛好位於兩個聚類的邊界上)

\(\qquad\) \(s(i)\) = -1 (聚類效果很差、觀測值可能被分錯類)

方法:
\(\qquad\) Step 1:計算不同 k 值的聚類演算法(例如:k - means cluster)。例如,透過將 k 從 1 個群變更為 10 個群。

\(\qquad\) Step 2:對於每個 k ,計算觀測值的平均 Silhouette ( average \(s(i)\) )。

\(\qquad\) Step 3:依照群數 k 繪製平均 Silhouette 的曲線。

\(\qquad\) Step 4:最大值的位置被認為是適當的群數。

3. Gap statistic

透過比較觀測資料與隨機生成資料的聚類結果,來確定最佳聚類數量

概念:
\(\qquad\) 1. 如果觀測資料比隨機生成資料能夠更好地形成明確的聚類, 那麼觀測資料的總平方和(\(𝑤_k\))應該明顯低於
\(\qquad\quad\) 隨機生成資料的總平方和。差距愈大,代表聚類愈明顯。

\(\qquad\) 2. 差距統計量:將不同 k 值的群內變異總數與在參考資料下的期望值進行比較,評估給定的聚類數目是否比隨機生成的資料好 。
\[\text{Gap}(k) = \frac{1}{B} \sum_{b=1}^{B} \log(w_{kb}^*) - \log(w_k)\]

\(\qquad\) 參考資料生成方式:Monte Carlo simulations, For all \(x_i \in D\) , D is the data set

\(\qquad\) Step 1:計算資料範圍 \([min⁡(𝑥_𝑖 ),max⁡(𝑥_𝑗 ) ]\)

\(\qquad\) Step 2:使用均勻分布隨機生成n個觀測值:\(𝑏_𝑘\) ~ \(U(min(x_i),max(x_j)),\hspace{0.5em} k=1,2,...,n\)

方法:
\(\qquad\) Step 1:將觀察到的資料進行聚類,改變聚類數量 \(𝑘 = 1, …, 𝑘_{m𝑎𝑥}\),並計算聚類內平方和 \(𝑤_𝑘\)

\(\qquad\) Step 2:產生具有隨機均勻分佈的 B 個參考資料集。

\(\qquad\) Step 3:將每個參考資料集與不同數量的群 \(𝑘 = 1, …, 𝑘_{m𝑎𝑥}\) 進行聚類,併計算群內聚類內平方和 \(𝑤_{𝑘𝑏}^∗\)

\(\qquad\) Step 4:計算差距統計量

\[\text{Gap}(k) = \frac{1}{B} \sum_{b=1}^{B} \log(w_{kb}^*) - \log(w_k)\]

\[E[\log(w_{kb}^*)] \text{:所有隨機資料集的平均} \log \text{總平方和 ,}\quad \log(w_k) \text{:觀測資料的} \log \text{總平方和}\]

\(\qquad\) Step 5:計算統計資料的標準差 \(s_k\), 令 \(\bar{w} = \frac{1}{B} \sum_{b} \log(w_{kb}^*),\)

\[\text{sd}_{k}= \sqrt{\frac{1}{B} \sum_{b} (\log(w_{kb}^*) - \bar{w})^2} ,\quad s_k = \text{sd}_k \times \sqrt{1 + \frac{1}{B}}\]

\(\qquad\) Step 6:選擇群數 \(k\),此 \(k\) 使得差距統計量在 \(k+1\) 處的值比再\(k\) 處的一個標準差之內的最小值:\[\text{Gap}(k) \geq \text{Gap}(k+1) - S_{k+1}\]

抽樣1

#######抽樣1#####
library(factoextra)
set.seed(121) #重複結果
df <- mean_geo[, 1:2]
sampled_df <- df[sample(nrow(df), 10000), ]#隨機抽一萬筆觀測值
sampled_df_sf <- st_as_sf(sampled_df, coords = c("mean_lng", "mean_lat"), crs = 4326)
p1 <- ggplot() +
  geom_sf(data = brazil_states, fill = "white", color = "black") +
  geom_sf(data = sampled_df_sf , aes(color = "Midpoint"), size = 1) +
  scale_color_manual(values = c("Midpoint"="orange")) +
  theme_minimal() +
  labs(title = "Sampling position",
       x = "Longitude", y = "Latitude",
       color = "Location Type")
# Elbow method
p2_30 <- fviz_nbclust(sampled_df, kmeans, method = "wss", k.max = 30) +
  geom_vline(xintercept = 4, linetype = 2)+
  labs(subtitle = "Elbow method")
# Silhouette method
p3_30 <- fviz_nbclust(sampled_df, kmeans, method = "silhouette", k.max = 30)+
  labs(subtitle = "Silhouette method")
# Gap statistic
# nboot = 50 to keep the function speedy.
# recommended value: nboot= 500 for  analysis.
# Use verbose = FALSE to hide computing progression.
set.seed(121)
p4_30 <- fviz_nbclust(sampled_df, kmeans, nstart = 25,  method = "gap_stat", nboot = 50, k.max = 30)+
  labs(subtitle = "Gap statistic method")
p1

p2_30

p3_30

p4_30

抽樣2

#####抽樣2#####
set.seed(122) #重複結果
df2 <- mean_geo[, 1:2]
sampled_df2 <- df2[sample(nrow(df2), 10000), ]#隨機抽一萬筆觀測值
sampled_df2_sf <- st_as_sf(sampled_df2, coords = c("mean_lng", "mean_lat"), crs = 4326)
p12 <- ggplot() +
  geom_sf(data = brazil_states, fill = "white", color = "black") +
  geom_sf(data = sampled_df2_sf , aes(color = "Midpoint"), size = 1) +
  scale_color_manual(values = c("Midpoint"="orange")) +
  theme_minimal() +
  labs(title = "Sampling position",
       x = "Longitude", y = "Latitude",
       color = "Location Type")
# Elbow method
p22_30 <- fviz_nbclust(sampled_df2, kmeans, method = "wss", k.max = 30) +
  geom_vline(xintercept = 4, linetype = 2)+
  labs(subtitle = "Elbow method")
# Silhouette method
p32_30 <- fviz_nbclust(sampled_df2, kmeans, method = "silhouette", k.max = 30)+
  labs(subtitle = "Silhouette method")
set.seed(122)
p42_30 <- fviz_nbclust(sampled_df2, kmeans, nstart = 25,  method = "gap_stat", nboot = 50, k.max = 30)+
  labs(subtitle = "Gap statistic method")
p12

p22_30

p32_30

p42_30

抽樣3

#####抽樣3#####
set.seed(123) #重複結果
df3 <- mean_geo[, 1:2]
sampled_df3 <- df[sample(nrow(df3), 10000), ]#隨機抽一萬筆觀測值
sampled_df3_sf <- st_as_sf(sampled_df3, coords = c("mean_lng", "mean_lat"), crs = 4326)
p13 <- ggplot() +
  geom_sf(data = brazil_states, fill = "white", color = "black") +
  geom_sf(data = sampled_df3_sf , aes(color = "Midpoint"), size = 1) +
  scale_color_manual(values = c("Midpoint"="orange")) +
  theme_minimal() +
  labs(title = "Sampling position",
       x = "Longitude", y = "Latitude",
       color = "Location Type")
# Elbow method
p23_30 <- fviz_nbclust(sampled_df3, kmeans, method = "wss", k.max = 30) +
  geom_vline(xintercept = 4, linetype = 2)+
  labs(subtitle = "Elbow method")
# Silhouette method
p33_30 <- fviz_nbclust(sampled_df3, kmeans, method = "silhouette", k.max = 30)+
  labs(subtitle = "Silhouette method")
# Gap statistic
set.seed(123)
p43_30 <- fviz_nbclust(sampled_df3, kmeans, nstart = 25,  method = "gap_stat", nboot = 50, k.max = 30)+
  labs(subtitle = "Gap statistic method")

p13

p23_30

p33_30

p43_30

抽樣4

#######抽樣4#####
set.seed(124) #重複結果
df4 <- mean_geo[, 1:2]
sampled_df4 <- df4[sample(nrow(df4), 10000), ]#隨機抽一萬筆觀測值
sampled_df4_sf <- st_as_sf(sampled_df4, coords = c("mean_lng", "mean_lat"), crs = 4326)
p1 <- ggplot() +
  geom_sf(data = brazil_states, fill = "white", color = "black") +
  geom_sf(data = sampled_df4_sf , aes(color = "Midpoint"), size = 1) +
  scale_color_manual(values = c("Midpoint"="orange")) +
  theme_minimal() +
  labs(title = "Sampling position",
       x = "Longitude", y = "Latitude",
       color = "Location Type")
# Elbow method
p24_30 <- fviz_nbclust(sampled_df4, kmeans, method = "wss", k.max = 30) +
  geom_vline(xintercept = 4, linetype = 2)+
  labs(subtitle = "Elbow method")
# Silhouette method
p34_30 <- fviz_nbclust(sampled_df4, kmeans, method = "silhouette", k.max = 30)+
  labs(subtitle = "Silhouette method")
set.seed(124)
p44_30 <- fviz_nbclust(sampled_df4, kmeans, nstart = 25,  method = "gap_stat", nboot = 50, k.max = 30)+
  labs(subtitle = "Gap statistic method")
p1

p24_30

p34_30

p44_30

抽樣5

#######抽樣5#####
set.seed(125) #重複結果
df5 <- mean_geo[, 1:2]
sampled_df5 <- df5[sample(nrow(df5), 10000), ]#隨機抽一萬筆觀測值
sampled_df5_sf <- st_as_sf(sampled_df5, coords = c("mean_lng", "mean_lat"), crs = 4326)
p1 <- ggplot() +
  geom_sf(data = brazil_states, fill = "white", color = "black") +
  geom_sf(data = sampled_df5_sf , aes(color = "Midpoint"), size = 1) +
  scale_color_manual(values = c("Midpoint"="orange")) +
  theme_minimal() +
  labs(title = "Sampling position",
       x = "Longitude", y = "Latitude",
       color = "Location Type")
# Elbow method
p25_30 <- fviz_nbclust(sampled_df5, kmeans, method = "wss", k.max = 30) +
  geom_vline(xintercept = 4, linetype = 2)+
  labs(subtitle = "Elbow method")
# Silhouette method
p35_30 <- fviz_nbclust(sampled_df5, kmeans, method = "silhouette", k.max = 30)+
  labs(subtitle = "Silhouette method")
set.seed(124)
p45_30 <- fviz_nbclust(sampled_df5, kmeans, nstart = 25,  method = "gap_stat", nboot = 50, k.max = 30)+
  labs(subtitle = "Gap statistic method")
p1

p25_30

p35_30

p45_30

抽樣6

#######抽樣6#####
set.seed(126) #重複結果
df6 <- mean_geo[, 1:2]
sampled_df6 <- df6[sample(nrow(df6), 10000), ]#隨機抽一萬筆觀測值
sampled_df6_sf <- st_as_sf(sampled_df6, coords = c("mean_lng", "mean_lat"), crs = 4326)
p1 <- ggplot() +
  geom_sf(data = brazil_states, fill = "white", color = "black") +
  geom_sf(data = sampled_df6_sf , aes(color = "Midpoint"), size = 1) +
  scale_color_manual(values = c("Midpoint"="orange")) +
  theme_minimal() +
  labs(title = "Sampling position",
       x = "Longitude", y = "Latitude",
       color = "Location Type")
# Elbow method
p26_30 <- fviz_nbclust(sampled_df6, kmeans, method = "wss", k.max = 30) +
  geom_vline(xintercept = 4, linetype = 2)+
  labs(subtitle = "Elbow method")
# Silhouette method
p36_30 <- fviz_nbclust(sampled_df6, kmeans, method = "silhouette", k.max = 30)+
  labs(subtitle = "Silhouette method")
set.seed(124)
p46_30 <- fviz_nbclust(sampled_df6, kmeans, nstart = 25,  method = "gap_stat", nboot = 50, k.max = 30)+
  labs(subtitle = "Gap statistic method")
p1

p26_30

p36_30

p46_30

抽樣7

#######抽樣7#####
set.seed(127) #重複結果
df7 <- mean_geo[, 1:2]
sampled_df7 <- df7[sample(nrow(df7), 10000), ]#隨機抽一萬筆觀測值
sampled_df7_sf <- st_as_sf(sampled_df7, coords = c("mean_lng", "mean_lat"), crs = 4326)
p1 <- ggplot() +
  geom_sf(data = brazil_states, fill = "white", color = "black") +
  geom_sf(data = sampled_df7_sf , aes(color = "Midpoint"), size = 1) +
  scale_color_manual(values = c("Midpoint"="orange")) +
  theme_minimal() +
  labs(title = "Sampling position",
       x = "Longitude", y = "Latitude",
       color = "Location Type")
# Elbow method
p27_30 <- fviz_nbclust(sampled_df7, kmeans, method = "wss", k.max = 30) +
  geom_vline(xintercept = 4, linetype = 2)+
  labs(subtitle = "Elbow method")
# Silhouette method
p37_30 <- fviz_nbclust(sampled_df7, kmeans, method = "silhouette", k.max = 30)+
  labs(subtitle = "Silhouette method")
set.seed(124)
p47_30 <- fviz_nbclust(sampled_df7, kmeans, nstart = 25,  method = "gap_stat", nboot = 50, k.max = 30)+
  labs(subtitle = "Gap statistic method")
p1

p27_30

p37_30

p47_30

抽樣8

#######抽樣8#####
set.seed(128) #重複結果
df8 <- mean_geo[, 1:2]
sampled_df8 <- df8[sample(nrow(df8), 10000), ]#隨機抽一萬筆觀測值
sampled_df8_sf <- st_as_sf(sampled_df8, coords = c("mean_lng", "mean_lat"), crs = 4326)
p1 <- ggplot() +
  geom_sf(data = brazil_states, fill = "white", color = "black") +
  geom_sf(data = sampled_df8_sf , aes(color = "Midpoint"), size = 1) +
  scale_color_manual(values = c("Midpoint"="orange")) +
  theme_minimal() +
  labs(title = "Sampling position",
       x = "Longitude", y = "Latitude",
       color = "Location Type")
# Elbow method
p28_30 <- fviz_nbclust(sampled_df8, kmeans, method = "wss", k.max = 30) +
  geom_vline(xintercept = 4, linetype = 2)+
  labs(subtitle = "Elbow method")
# Silhouette method
p38_30 <- fviz_nbclust(sampled_df8, kmeans, method = "silhouette", k.max = 30)+
  labs(subtitle = "Silhouette method")
set.seed(124)
p48_30 <- fviz_nbclust(sampled_df8, kmeans, nstart = 25,  method = "gap_stat", nboot = 50, k.max = 30)+
  labs(subtitle = "Gap statistic method")
p1

p28_30

p38_30

p48_30

抽樣9

#######抽樣9#####
set.seed(129) #重複結果
df9 <- mean_geo[, 1:2]
sampled_df9 <- df9[sample(nrow(df9), 10000), ]#隨機抽一萬筆觀測值
sampled_df9_sf <- st_as_sf(sampled_df9, coords = c("mean_lng", "mean_lat"), crs = 4326)
p1 <- ggplot() +
  geom_sf(data = brazil_states, fill = "white", color = "black") +
  geom_sf(data = sampled_df9_sf , aes(color = "Midpoint"), size = 1) +
  scale_color_manual(values = c("Midpoint"="orange")) +
  theme_minimal() +
  labs(title = "Sampling position",
       x = "Longitude", y = "Latitude",
       color = "Location Type")
# Elbow method
p29_30 <- fviz_nbclust(sampled_df9, kmeans, method = "wss", k.max = 30) +
  geom_vline(xintercept = 4, linetype = 2)+
  labs(subtitle = "Elbow method")
# Silhouette method
p39_30 <- fviz_nbclust(sampled_df9, kmeans, method = "silhouette", k.max = 30)+
  labs(subtitle = "Silhouette method")
set.seed(124)
p49_30 <- fviz_nbclust(sampled_df9, kmeans, nstart = 25,  method = "gap_stat", nboot = 50, k.max = 30)+
  labs(subtitle = "Gap statistic method")
p1

p29_30

p39_30

p49_30

抽樣10

#######抽樣10#####
set.seed(1210) #重複結果
df10 <- mean_geo[, 1:2]
sampled_df10 <- df10[sample(nrow(df10), 10000), ]#隨機抽一萬筆觀測值
sampled_df10_sf <- st_as_sf(sampled_df10, coords = c("mean_lng", "mean_lat"), crs = 4326)
p1 <- ggplot() +
  geom_sf(data = brazil_states, fill = "white", color = "black") +
  geom_sf(data = sampled_df10_sf , aes(color = "Midpoint"), size = 1) +
  scale_color_manual(values = c("Midpoint"="orange")) +
  theme_minimal() +
  labs(title = "Sampling position",
       x = "Longitude", y = "Latitude",
       color = "Location Type")
# Elbow method
p210_30 <- fviz_nbclust(sampled_df10, kmeans, method = "wss", k.max = 30) +
  geom_vline(xintercept = 4, linetype = 2)+
  labs(subtitle = "Elbow method")
# Silhouette method
p310_30 <- fviz_nbclust(sampled_df10, kmeans, method = "silhouette", k.max = 30)+
  labs(subtitle = "Silhouette method")
set.seed(124)
p410_30 <- fviz_nbclust(sampled_df10, kmeans, nstart = 25,  method = "gap_stat", nboot = 50, k.max = 30)+
  labs(subtitle = "Gap statistic method")

p1

p210_30

p310_30

p410_30

平均10次抽樣做 Gap statistic method 圖表

取前20名依照顏色深淺畫線

#####平均10次抽樣做 Gap statistic method 圖表 #####
total_gap <- data.frame(p4_30$data$gap ,p42_30$data$gap,
    p43_30$data$gap, p44_30$data$gap,
    p45_30$data$gap, p46_30$data$gap,
    p47_30$data$gap, p48_30$data$gap,
    p49_30$data$gap, p410_30$data$gap)

total_gap <- total_gap %>% mutate(
  mean_gap = apply(.,1,mean),
  sd_gap = apply(.,1,sd),
  k = 1:nrow(total_gap)
)
top_n <- 20
top_mean_gap_indices <- order(total_gap$mean_gap, decreasing = TRUE)[1:top_n]
colors <- colorspace::sequential_hcl(n = top_n, palette = "Heat 2")

ggplot(total_gap, aes(x = k, y = mean_gap)) +
  geom_line() +
  geom_point() +
  lapply(seq_along(top_mean_gap_indices), function(i) {
    geom_vline(xintercept = total_gap$k[top_mean_gap_indices[i]],
               linetype = "dashed", color = colors[i], size = 1.1)
  }) +
  scale_x_continuous(breaks = c(1:30)) +
  labs(title = "Gap Statistic Method",
       subtitle = "Average Gap Statistics over 10 runs",
       x = "Number of clusters (K)",
       y = "Gap Statistic") +
  theme_minimal()

結論

根據10次隨機抽樣結果:

從10萬筆資料中,隨機抽樣1萬筆出來,做各項檢定。

1.(Elbow Method): k = 4 以後 WCSS 值驟降。

2.( Silhouette Coefficient):k = 2 時是分群品質最好的指標,其後的 K 值趨於穩定狀態。

3.(Gap Statistic)生成50個隨機分佈,並從10組 Gap Statistic 資料中,做各cluster的平均值,重新畫圖。

\(\quad\) 按照Gap Statistic指標高低排序:

  1. k = 1
  2. k = { 29, 30, 26 }
  3. k = 3
  4. k = {17, …, 28} \ {26}
  5. k = {14 , 6, 5 }

根據Gap Statistic統計量及現實評估後,建議發展順序:

k = 1 \(\rightarrow\) k = 3 \(\rightarrow\) k = 5 \(\rightarrow\) k = 14

kmeans k=5

# kmeans k=5
k <- 5
set.seed(12)
kmeans_result5 <- kmeans(mean_geo[,1:2], centers=k)
mean_geo$cluster <- as.factor(kmeans_result5$cluster)
mean_geo_sf5 <- st_as_sf(mean_geo, coords = c("mean_lng", "mean_lat"), crs = 4326)
centers_df5 <- as.data.frame(kmeans_result5$centers)
centers_df5$cluster <- factor(1:nrow(centers_df5))  # 添加cluster標籤
unique(centers_df5)
##    mean_lat  mean_lng cluster
## 1 -25.77325 -49.60844       1
## 2 -23.17350 -47.28295       2
## 3 -21.88553 -44.90911       3
## 4 -14.90212 -42.89579       4
## 5 -19.03393 -49.18551       5
ggplot() +
  geom_sf(data = brazil_states, fill = "white", color = "black") +  
  geom_sf(data = mean_geo_sf5, aes(color = cluster), size = 1) + 
  geom_point(data = as.data.frame(kmeans_result5$centers), aes(x = mean_lng, y = mean_lat), 
             color = 'red', size = 5, shape = 4) + 
  geom_text(data = centers_df5, 
            aes(x = mean_lng, y = mean_lat,
                label = paste("(",round(mean_lng, 2),",", round(mean_lat, 2), ")",
                              sep = "")), 
            vjust = -1, hjust = 0.5, color = "black", size = 3) + 
  labs(title = paste("K-means cluster (K=", k, ")", sep=""), x = "Longitude", y = "Latitude") + 
  theme_minimal()

kmeans k=14

# kmeans #k=14
set.seed(12)
k <- 14
kmeans_result14 <- kmeans(mean_geo[,1:2], centers=k)
mean_geo$cluster <- as.factor(kmeans_result14$cluster)
mean_geo_sf14 <- st_as_sf(mean_geo, coords = c("mean_lng", "mean_lat"), crs = 4326)
centers_df14 <- as.data.frame(kmeans_result14$centers)
centers_df14$cluster <- factor(1:nrow(centers_df14))  # 添加cluster標籤
unique(centers_df14)
##     mean_lat  mean_lng cluster
## 1  -24.89890 -47.93623       1
## 2  -21.42944 -43.60574       2
## 3  -17.51547 -42.91844       3
## 4  -21.50916 -45.68617       4
## 5  -19.36181 -47.72873       5
## 6  -12.90147 -46.64863       6
## 7  -13.94465 -41.22025       7
## 8  -27.31003 -51.30935       8
## 9  -23.15399 -49.97409       9
## 10 -22.32487 -47.83565      10
## 11 -17.04051 -53.11055      11
## 12 -23.01799 -44.95821      12
## 13 -26.18626 -49.37412      13
## 14 -23.44512 -46.66692      14
ggplot() +
  geom_sf(data = brazil_states, fill = "white", color = "black") +  
  geom_sf(data = mean_geo_sf14, aes(color = cluster), size = 1) + 
  geom_point(data = as.data.frame(kmeans_result14$centers), 
             aes(x = mean_lng, y = mean_lat), 
             color = 'red', size = 5, shape = 4) + 
  geom_text(data = centers_df14, 
            aes(x = mean_lng, y = mean_lat,
                label = paste("(",round(mean_lng, 2),",", round(mean_lat, 2), ")",
                              sep = "")), 
            vjust = ifelse(centers_df14$cluster %in%c(1,10,13,14),1.2,-1), 
            hjust = ifelse(centers_df14$cluster %in%c(1,2,10,12,13,14),-0.3,0.5), 
            color = "black", size = 3) + 
  labs(title = paste("K-means cluster (K=", k, ")", sep=""), x = "Longitude", y = "Latitude") + 
  theme_minimal()

K-means cluster in different K

當中心點 K 從1慢慢增加到5個點時,觀察上: 舊有位置是重疊或附近的狀態,因此可以以此發展慢慢擴增。

# k=1 & 3 & 5 
ggplot() +
  geom_sf(data = brazil_states, fill = "white", color = "black") +  
  geom_point(data = as.data.frame(kmeans_result1$centers), 
             aes(x = mean_lng, y = mean_lat, color = "cluster 1"), size = 5, shape = 16) + 
  geom_point(data = as.data.frame(kmeans_result3$centers), 
             aes(x = mean_lng, y = mean_lat, color = "cluster 3"), size = 3, shape = 17)+
  geom_point(data = as.data.frame(kmeans_result5$centers), 
             aes(x = mean_lng, y = mean_lat, color = "cluster 5"), size = 3, shape = 15)+
  scale_color_manual(values = c("cluster 1" = "red", "cluster 3" ="#FF3399","cluster 5"="#FF9933"),
                     breaks = c("cluster 1", "cluster 3", "cluster 5")) +
  labs(title = paste("K-means cluster in different K", sep=""),
       x = "Longitude", y = "Latitude",color = "number of K") + 
  theme_minimal()

當中心點 K 從5 增加到14個點時,觀察上:

  1. cluster 1 起始點在分成14群後依舊是疊在一起的的中心點之一,可視為一開始的倉儲物流中心。

  2. 大部分變成1個舊有中心點擴散成兩三點,並非舊有位置是最佳中心地點。

# k=1 & 3 & 5 & 14
ggplot() +
  geom_sf(data = brazil_states, fill = "white", color = "black") +  
  geom_point(data = as.data.frame(kmeans_result1$centers), 
             aes(x = mean_lng, y = mean_lat, color = "cluster 1"), size = 5, shape = 16) + 
  geom_point(data = as.data.frame(kmeans_result3$centers), 
             aes(x = mean_lng, y = mean_lat, color = "cluster 3"), size = 3, shape = 17)+
  geom_point(data = as.data.frame(kmeans_result5$centers), 
             aes(x = mean_lng, y = mean_lat, color = "cluster 5"), size = 3, shape = 15)+
  geom_point(data = as.data.frame(kmeans_result14$centers),
             aes(x = mean_lng, y = mean_lat, color = "cluster 14"), size = 3, shape = 8, stroke = 1.1)+
  scale_color_manual(values = c("cluster 1" = "red", "cluster 3" ="#FF3399",
                                "cluster 5"="#FF9933","cluster 14"="#66CC00"),
                     breaks = c("cluster 1", "cluster 3", "cluster 5", "cluster 14")) +
  labs(title = paste("K-means cluster in different K", sep=""),
       x = "Longitude", y = "Latitude", color = "number of K") + 
  theme_minimal()